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^ (54) Title: A METHOD AND APPARATUS FOR CARRYING OUT EFFICIENTLY ARITHMETIC COMPUTATIONS IN 
ATI HARDWARE 

^2 (57) Abstract: A method for carrying out modular arithmetic compulations involving multiplication operations by utilizing a non-re- 
^ > duced and extended Montgomery multiplication between a first A and a second B integer values, in which the number of iterations 
q required is greater than the number of bits n of an odd modulo value N. The method comprises storing n+2 bit values in an accu- 
^ mulating device (S) capable of, of adding n+2-bit values (X) to it content, and of dividing its content by 2. Whenever desired, the 
q content of the accumulating device is set to zero value. At least s(>n+l) iterations of the following steps are performed, while in each 
iteration choosing one bit, in sequence, from the value of said first integer value A, starting from its least significant bit: adding to the 
Qlp content of the accumulating device S the product of the selected bit and said second integer value B; adding to the resulting content 
^> the product of its current least significant bit and N; dividing the result by 2; and obtaining a non-reduced and extended Montgomery 
^ multiplication result by repealing these steps s-1 additional limes while in each time using the previous result (S). 
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A METHOD AND APPARATUS FOR CARRYING OUT EFFICIENTLY 
ARITHMETIC COMPUTATIONS IN HARDWARE 

Field of the Invention 

The present invention relates to the field of fast and efficient implementation of 
modular arithmetics in hardware. More particularly, the invention relates to a 
method and apparatus for carrying out modular arithmetic operations such as 
modular multiplication and exponentiation, utilizing Montgomery and 
straightforward methods. 

Background of the Invention 

The core operations of modern Public Key Cryptosystems' (PKC) are typically 
based on performing modular arithmetic functions, in particular modular 
exponentiation, where modular exponentiation is essentially based on sequences 
of modular multiplications and modular squares. Consequently, fast methods for 
performing modular arithmetic functions, particularly in hardware, are of great 
importance for practical implementation of PKC. The Montgomery method offers 
an efficient way of carrying out some modular operations, most important of 
which is modular exponentiation. The advantage of this method is mostly 
appreciated in hardware implementations of modular exponentiation. Thus, the 
Montgomery method is widely adopted in implementations of PKCs that 
implement, for example, RSA, Digital Signature Standard (DSS), Diffie-Hellman 
(DF) key exchange, and Elliptic Curve Cryptography (ECC) algorithms 
("Handbook of Applied Cryptography 99 by Alfred J. Menezes, Paul C. van 
Oorschot and Scott A. Vanstone, CRC Press October 1996). 

Montgomery Multiplication, Definition: Given the n-bit integers A, B> and N 
(N>A,B, N is odd), the Montgomery multiplication MMUL(A 9 B, N, n) , 
denoted also by MMUL(A 9 B) (for short), is defined by: 

MMUL(A,B) = A*B*2~ n mo&N 
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Which yields a reduced result , i.e., 0 < MMUL(A 9 B)<N. 

Notations: In the following discussion, the bits of integer values, such as the n- 
bit integer A = (A n ^ 9 ... 9 A l9 A 0 ) 29 are represented utilizing the notation 
A, (0</ </? -l), wherein the Most Significant Bit (MSB) A„_ x is the leftmost bit, 
and the Least Significant Bit (LSB) A 0 is the rightmost bit, of the integer value 
A. Additionally, the value of a given variable S , in the j-th iteration, is denoted 
by Sq). The notations of modular results, such as A*B modiV, refer to their 
reduced value in the range [0, N). 

An algorithm for computing Montgomery multiplication (in radix 2) can be 
carried out by the following steps: 

Algorithm 1; 

Input: A y B y N 9 n (Precondition: A, J5, N are ?x-bit integers, satisfying 
N> A 9 B and N is odd) 

Output: MMUL(A, B) = A *B*2~ n modN 
S=0 

For I from 0 to /7-1 do 

— > 

1.1 S = S + Aj*B 

1.2 S = S + S 0 *N 

1.3 5 = 5/2 
End for 

1.4 If S>N Then S = S-N 
Return S 

The algorithm main loop requires only a series of additions (steps 1.1 and 1.2) 
and divisions by 2 (step 1.3). Step 1.4, called herein the reduction step, is an 
essential step without which the output of the algorithm, S , is not necessarily 
reduced. 
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Example 1: Table 1 illustrates this process of computing MMUL (A, B) for 
^ = 18 = (l0010) 2 , 5 = 12 = (01100) 2 , with N = 19 = (l001l) 2 . In this example* 77 = 5 

and the Montgomery multiplication is 18 * 12 * 2~ 5 modl9 = 2 



TABLE 1: (Precondition: 5 = 0, ,4 = 18, B =12, and N = 19) 



/ 


4 


S = S+Aj*B 


So 


S = (S + S 0 *N)/2 


0 


0 


0 


0 


0 


1 


1 


12 


0 


6 


2 


0 


6 


0 




3 


0 


3 


1 


11 


4 


1 


23 


1 


21 



Without step 1.4, the output of the algorithm, S , is not necessarily in the range 
[0, N). In particular, S may be of more than n bits. Thus, the additional 
reduction (S = S-N) (step 1.4) is sometimes required in order to shift the 
algorithm's output to the range [0, N). In Example' 1 above, the calculation 
result is S = 21 > , and thus the additional reduction S = 5 - N = 21 - 19 = 2 is 
required in this case. In the case where A 9 B<N, as assumed, it can be shown (by 
induction) that before the reduction step (1.4) the result, S, is bounded hyN + B . 
Thus, in the cases where S>N , after the iteration steps 1.1, 1.2, and 1.3, the 
additional reduction step 1.4 (S = S-N) f that is performed at most only once, is 
sufficient to reduce the final result to the range [0, N), and therefore to ensure 
that the desired result S = A*B*2~ tt modN is indeed the output of the 
algorithm. 

This Montgomery multiplication algorithm, which computes MMUL^A.B) can be 
used for computing the regular modular multiplication A * B mod N . This can be 
carried out in more than one way, as illustrated in the following steps: 
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Method 1: 

Input: A, B, N, A (A, B, and N are 7i-bit integers, pre-computed value: 
A' = A*2 n modN) 
Output: A*B mod N 

T = MMUL(A',B) 
Return T 

For example, for the case of A = 18, B = 12, N = 19, and n=5, the auxiliary value 
A' = 18 * 2 s modl9 = 6 is pre-computed, and is then used to calculate: 
T = MMUL(A',B) = 6*12*2-* modl9 = 7 

Method 2: 

Input: A, B y N, A', B (A, B, and N are ra-bit integers, pre-computed values: 
A' = A*2" modN and B' = B*2" modN) 
Output: A*B mod N 

T = MMUL{A\B') 
T = MMUL{T,\) 
Return T 

For example, for the case of ,4 = 18, 5 = 12, N = \9 , and n=5, two auxiliary 
values are pre-computed: A' = 1 8 * 2 5 mod 19 = 6 and 5' = 12 * 2 5 mod 19 = 4 
which are then used to calculate: T = MMUL{A', B') = 6*4* 2" 5 modl9 = 1 5 
and finally, the result is computed by: 

T = MMUL{T,l) = 1 5 * 1 * 2~ 5 modl9 = 7 

Method 2 involves the computation of auxiliary values, A' and B' . This 
transforms the integers A and B to what is called the "Montgomery base". The 
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first Montgomery multiplication is applied to the transformed numbers, 
resulting in: 

T = mtUL{A\B')=A'*B'*2- n mo&N = A*B*2 n modi\T 

This corresponds to the regular modular multiplication in the regular 
representation of A and B . 

The second Montgomery multiplication (by 1) converts the result back to the 
regular base representation. In other words, it removes the redundant 2 n factor 
from the above result, T = h4MUL(A f 9 B')> thus obtaining the requested result: 

7 , = Mm(r,l) = (^*5*2 n )*l*2- B modN = A*B modN 

The overhead involved with Method 1 (computing the auxiliary value) is the 
main reason for which the Montgomery algorithm is not necessarily considered 
useful for computing a single modular multiplication, in comparison with a 
direct approach. However, Method 2 can be used efficiently when several 
modular multiplications are required. After converting the input to the 
Montgomery base, all multiplications are performed by means of the 
Montgomery multiplication algorithm, and the result is converted to the regular 
base at the end of the multiplications sequence. In such cases, the computational 
overhead of Method 2 is negligible, and the Montgomery algorithm substantially 
improves the efficiency in the overall calculations. The most typical example is 
the computation of the modular exponent A B mod// (for an m-bit integer value 
exponent E, where with no lose of generality, we assume here that A<N), 
utilizing Method 2 and the Montgomery multiplication. The exponentiation 
result can be computed, for example, as described hereinbelow (left-to-right 
binary exponentiation): 



Algorithm 2: 
Input: A, E, N 
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Output: A E modN 

T (mA) =A' = A*2" modN 
For I from m~2 to 0 do 

2.1 T U) =MMUL(T (M) ,T (M) ) 

-2.2 if Ej =1 then T {J) = MMUL{T (J) ,A') 

End for 

23 r (0) =Ml^L(r (0) ,l) 
Return r (0) 

The computation of the pre-calculated value A' = A*2" modN (0<>A'<N) 
converts the input to the Montgomery base, the Montgomery multiplications 
and squaring (steps 2.1 and 2.2) correspond to the sequence of multiplications 
and squaring that implement the left-to-right binary exponentiation in the 
regular base, and the Montgomery multiplication by 1 (step 2.3) converts the 
result back to the regular base. Reduction (step 1.4) in intermediate steps, in 
each Montgomery multiplication implemented by algorithm 1, is required in 
order to make sure that the result remains bounded by N . The reduction is of 
vital importance in implementation of such chained algorithms, since it assures 
that the input to the subsequent Montgomery multiplication is properly 
bounded. If reduction is not performed, and the result of one Montgomery 
multiplication (without the reduction step) exceeds N, overflow or erroneous 
results may occur in subsequent steps. 

The main advantage in using the Montgomery multiplication lies in the 
hardware implementation of this multiplication operation. The MMUL 
algorithm requires, in each step, only the LSB of the accumulating result (step 
1.2 above S = S + S 0 * N ). 
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The following example demonstrates an exponentiation operation carried out 
utilizing the algorithm described hereinabove. In this example the calculation of 
212 240 mod249 = 241 is computed. 

Example 2: Table 2 illustrates the calculation of A B modN , for rc-bits values A 
and N, and the /n-bit value E, utilizing the algorithm herein above. In table 2, 
the value obtained in the preceding step is followed by the result obtained 

in step 2.1 T{ 1+l f , and the result obtained in step 2.2, T^. In this example 
A = 212 ,. E = 240 = (l 1 1 10000) 2 , and N = 249 . Hence, A is of « = 8 bits, . S is of 
m = 8 bits, and the pre-calculated value required is A' - 212 * 2 8 mod249 = 239 . 

TABLE 2: (Precondition: ,4 = 212, £ = 240 = (llll0000) 2 , # = 249, and 
r (7) = ^' = 239) 



/ 




T (M) 






6 


1 


239 


370-249=121 


254-249=5 


5 


1 


5 


217 


437-249=188 


4 


1 


188 


247 


323-249=74 


3 


0 


74 


142 


142 


2 


0 


142 


106 


106 


1 


0 


106 


289-249=40 


40 


0 


0 


40 


193 


193 



And the final result is obtained by computing 
T (0) =MMUL{T (0) ,i) = l93*l*2-* mod249 = 241. 

In this example, the Montgomery multiplication 14MUL{A,B) is utilized for the 
calculation of Montgomery multiplication, Montgomery square, and 
Montgomery multiplication by 1. As was previously discussed, before the 
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reduction step (1.4), the accumulated result may be greater than N, and 
reduction may be required in order to obtain the (correctly reduced) results of 
the Montgomery multiplication. 

In, Example 2, for J=6, 5, and 4, reduction was required in performing 
MMUL(t (j> A'), and for 1=1 and 6 in performing MMUL(t (m> T [m) ). 

It should be noted that the need for reductions substantially complicates 
hardware realizations of such apparatus, particularly when the number of bits n 
is significantly large (e.g., ti=512). Dedicated circuitry is required for detecting - 
the cases where the result is greater than N, and for performing the appropriate 
subtraction (i.e., the required reduction). 

Efficient implementations of integer multiplication, achieved by indirect 
methods that avoid actual multiplication, are known in the literature (e.g., K. 
Hwang, Computer Arithmetic; Principles, Architecture, and Design, Wiley, New- 
York, 1979; Chapter 5). Such methods obtain the multiplication result by means 
of successive additions of appropriately pre-chosen quantities. For example, the 
value S = S + M*A, where M is of m=2 bits long, can be obtained without 
directly computing the product M*A, by using only additions of. three pre-stored 
quantities, as follows. The quantity to be added to the accumulator depends on 
one of the four possible cases M=(0,0), M=(0,1), M=(1,0), M=(l,l): 
If M=(0,0), nothing is added to the accumulator S. 
If ikf=(0,l), the value A is added to the accumulator S. 
If M=(1,0), the value 2*A is added to the accumulator S. 
If M=(l,l), the value 3*A = A+2*A is added to the accumulator S. 

Thus, the sum S = S + M * A can be obtained in one operation, by identifying the 
appropriate case (a 1:4 multiplexer in hardware) and adding, accordingly, either 
0, A, 2*A or 3*A to the accumulator. The additional storage of A, 2*A and 3*A 
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may be bypassed at the cost of (cumbersome) setting the hardware control 
accordingly: adding 2*A may be implemented by shifting the stored value of A 
and then feeding it to the accumulator, and adding 3*A may be implemented by 
adding the value of A and the shifted value of A to the accumulator. 

Consequently, optimizing this operation requires balancing between storage and 
speed/hardware requirements. The extra storage of the values A, 2*A 9 3*A may 
be advantageous if the same operation is repeated many times. For example, the 
computation of S = S + K*A when K is of k bits long, can be achieved iteratively. 
In each of (l+|7e/m]) = (l+[fc/2]) iterations, the m=2 next bits of K are scanned 
and define a temporary value of M (m-bit portions of M), with which the above 
method is used. The number of bits m, designates the bit length of those 
temporary values (portions of M) 9 and thus also define the number of right shifts 
that should be performed to the addition result S = S + K*A. Analogous 
methods use larger values of m, more storage or hardware/control, but a smaller 
number (l+[k/m]) of iterations. The same method can be used when the value 
M *A + L*B is to be added to the accumulator, in order to compute 
S = S + M*A + L*B.Ixi such case, scanning m bits of M and L in each iteration 
yields 2 lm combinations for the quantity that is to be added. 

For example, with ro=2, the 2 2 * 2 = 16 combinations for the added quantity are: 0, 
A, 2*A 9 3*A 9 B, 2*5, 3*5, A + B, A+2*B 9 A+3*B 9 2*A+B 9 2*(A + B), 
2*A+3*B 9 3*^ + 5, 3*^ + 2*5, 3*(^ + 5). Storage of 15 quantities is needed 
unless extra hardware/control is used for adding 2(A+j3) and/or adding 3(A+jB) 
by using the stored value of (A+B). For m=l, there are 2 2 * 1 =4 combinations 
namely: 0, A, B, A+B. The case m=l is illustrated in Fig. 1 for carrying out 
multiplication and summation operations of four integers, A, B, C, and D. The 
apparatus depicted in Fig. 1 utilizes three registers R0, Rl, and R2 y a 1:4 
multiplexer (MUX), and a Carry Save Adder (CSA), to carry out the calculation 
of^*J5-fC*D + G. The registers R0 and R2 9 are n-bits each, while register Rl is 
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of n+1 bits. Each of the registers, R0, Rl, and R2, is connected to one of the MUXs 
inputs, In2,In3 9 and Inl, respectively, while the MUX's input InO is constantly 

fed by a "0" value (an Ti-bit value). 

The multiplexer MUX has two control inputs, CO and CI, such that for each state 
of the control inputs, CO and CI, a corresponding input is selected, and output on 
the MUX'S output (out). The calculation of ,4*5 + C*D + G is carried out by 
loading registers R0, Rl 9 R2 9 and the CSA with the values of D, B+A B, and G, 
respectively, and serially feeding the data bits of A and C 
(Aj and C } (/ = 0,1,2,..., n-i)), through the MUX'S control inputs, CO and Cl 
respectively. 

The CSA is of n+ 2 bits, to allow over flow of 2 bits, and it is utilized for adding 
the- value of the selected input (InOJn\ 9 In2 9 or In3), retrieved via the MUXs 
output out, to its present content. The result of this addition is stored in the 
CSA, which is then subject to a right shift performed to the CSA content. 
Shifting the bits of an even binary value to the right is equivalent to the division 
of that value by 2 (in step 1.3 above). Thus, in each cycle in the operation of this 
system, the following operations are performed: 
• 1) selection of the respective value on InO, Inljnl, and In3 ; 

2) addition of the selected value with the current content of the CSA 
register; and 

3) right shifting the CSA bits, which also introduce the LSB of the CSA 
(i.e.,CiS4o) on the CSA 0 output. 

To implement Steps 1 and 2, the bits of A and C, A l and C 7 (/ = 0,1,2,..., n - 1) , 
are serially introduced on the MUXs control inputs, CO and Cl, starting with 
the LSBs. Consequently, the MUX's output out^ may take any of the following 

values in each and every iteration I: 
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OUt {l) = 



0 if Aj=Cj=0 

B if A, =1 and C, = 0 , . 

D if A,=0 ana C,-l « = 

B + D if Aj=Cj=l 

The process of calculating A*B + C*D + G is further described by the following 
pseudocode. 

D J?0 .B + D-* Rl . B -» R2 .G -> CSA 
For I from 0 to /7-1 Do 
G£4 {/+1) « (CS4 (/) + out {l) )ll 
End For 

After n iterations the CSA's content (CK4 (jM )) holds the ra+1 Most Significant 

Bits (MSB) of the calculated result, and another n LSBs, of the calculated result, 
are obtained on the CSA^ output, during the iterations. The CSA's content may 
be output utilizing a parallel output bus (not illustrated), or alternatively, by 
resetting the MUX'S control inputs (i.e., set CO=Cl=0), and performing n+1 
additional iterations, to output the n+1 MSBs of the result, on the CSA 0 output 

(serial approach). The main drawback of the serial approach is that it is time- 
consuming (the addition of n+1 cycles is required to obtain the CSA content). On 
the other hand, although performance is significantly improved utilizing the 
parallel approach, it is considered costly in terms of hardware means. 

This apparatus is efficiently utilized to perform Montgomery multiphcation by 
applying the Montgomery method, as described in Patent Application WO 
98/50851 and US 6,185,596. In those Patent Applications a precomputed 
constant (J = -AT 1 mod 2") is utilized to calculate in each iteration the number 

of times, Y = (A*B*j) mod 2", that modulus N should be added to the 

multiphcation of A*B . This method requires testing, after each iteration of the 
Montgomery process, if the addition result exceeds the modulus value N. hi such 
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cases, the result does not exceed 2*JV. Consequently, dedicated hardware is 
utilized in those implementations for testing the result in each iteration, and for 
subtracting the modulus value N from the result, whenever it exceeds the 
modulus value. 

Methods for implementing modular multiplication by using the Montgomery 
multiplication as known in the art, are mainly affected - in both time and 
hardware - by the need to reduce the output resulting values, to values which 
are smaller than N. Furthermore, the reduction step, being dependent on the 
specific input (via the "if statement) makes this implementation susceptible to 
(side channels) attacks. Therefore, although the Montgomery multiplication 
method enables efficient hardware implementation of modular arithmetic 
operations, such as modular exponentiation, there is a need for improving the 
hardware implementations of such operations. This may be achieved utilizing a 
method and an apparatus that does not require repeated reduction after each 
Montgomery multiplication. 

It is an object of the present invention to provide a method and apparatus for 
carrying out a modified version of Montgomery multiplication in which the 
intermediate and the final calculation results do not exceed known bounds, and 
wherein no reduction is required during a chained sequence of such modified 
Montgomery multiplication, such as the sequence required for an exponentiation 
process, and the final result of the exponentiation process, is automatically 
reduced (between 0 and N). 

It is another object of the present invention to provide a method and apparatus 
(called also a PKI apparatus herein) allowing efficient hardware 
implementations of modular exponentiation, and other modular arithmetic 
operations, based or not based on the Montgomery multiplication, which include 
the basic operations required for hardware implementation of public key 
cryptosy stems. 
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It is yet another object of the present invention to provide a method and 
apparatus allowing efficient hardware implementations of various modular 
exponentiation algorithms such as right- to -left, left-to-right, m-array, and 
sliding- window exponentiation algorithms. 

It is a still further object of the present invention to provide a method and 
apparatus for a secure PKI apparatus, based on a non-reduced and modified 
Montgomery multiplication, which is proof against timing attacks. 

Summary of the Invention 

In one aspect the present invention is directed to a method for carrying out 
modular arithmetic computations involving multiplication operations by 
utilizing a non-reduced and extended Montgomery multiplication between a 
first A and a second B integer values, in which the number of iterations 
required is greater than the number of bits n of an odd modulo value N, the 
method comprising: 

a) providing an accumulating device (S) capable of storing n+2 bit values, of 
adding n+2-bit values (X) to it content (S + X->S), and of dividing its 
content by 2 (S /2-±S); 

b) whenever desired, setting the content of the device to a zero value 
C°"~* S ) and performing in the device at least s(>n+l) iterations, while in 
each iteration choosing one bit, in sequence, from the value of the first 
integer value A (A, ;0 £ J £ starting from its least significant bit 
U): 

b.l) adding to the content of the device S the product of the selected bit 
A, and the second integer value B (S + Aj * B -» S); 

b.2) adding to the resulting content of the device the product of its 
current least significant bit S 0 and N (S + S 0 * N S); 

b.3) dividing the resulting content of the device by 2 (S 12 S ); and 
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b.4) obtaining a non-reduced and extended Montgomery multiplication 
result by repeating steps b.l) to b.3) s-1 additional times while in 
each time using the previous result (S). 



The Montgomery multiplication result can be obtained by unifying steps b.l) to 
b.3) into a single step, by providing a first storing device (R2) for storing the 
modulo value N, a second storing device (R0) for storing the value of the second 
integer £>, a third storing device (Rl) for storing the sum of the modulo N and 
the second integer value B, providing an arbitration circuitry having a first 

second (In 2) and third (In3\ inputs from the first (R2\ second (R0) and _ 
third (Rl), storage devices respectively, and having an additional zero input 
(InO), the arbitration device receives a first (Cl) and a second (CO) control 
inputs, and is capable of selecting one of its other inputs as it output, such that: 

whenever its first (Cl) and second (CO) control inputs are zero, selecting 
the additional zero input (InO); 

whenever its first control input (Cl) is one and its second control input 
(CO) is zero, selecting its second input (In2); 

whenever its first control input (Cl) is zero and its second control input 
(CO) is one, selecting its first input (Inl)\ and 

whenever its first (Cl) and second (CO) control inputs are one, selecting 
the third input (In3); 

wherein the selected input is provided as the output of the arbitration circuitry 
which is attached to the input of the accumulating device. The computation is 
carried out by applying the bits of the first integer value A(A X ;0<I <s) y one 
by one, in sequence, starting from its least significant bit (A 0 ), to the first 
control input (Cl), and providing circuitry for producing the state (Kj) of the 
second control input (CO) according to the state of the selected bit of the first 
integer value (Aj), the state of the least significant bit of the second integer 
value (B 0 ), and according to the state of the least significant bit of the 
accumulating device ( S Q ) . 
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The state (Kj) of the second control input (CO) can be produced by producing a 
value of one (K, ="1") whenever the state of the first control input (Cl) and the 
state of the least significant bit of the second integer value (B 0 ) are one, and the 
state of the least significant bit of the accumulating device (S 0 ) is zero, or when 
the state of the first control input (Cl) and the state of the least significant bit 
(B 0 ) of the second integer value B are in different state, and the state of the 
least significant bit (S 0 ) of the accumulating device is one, otherwise a zero 
value (Kj ="0") is produced as the state (K, ) of the second control input (CO). 

The state of the second control input (CO) can be produced by circuitry 
comprising a logical AND gate, and a logical XOR gate, where the inputs of the 
logical AND gate are receiving the states of the first control input (Cl) and the 
state of the least significant bit (B 0 ) of the second integer value B, and where 
the inputs of the logical XOR gate are receiving the output from the logical AND 
gate and the state of the least significant bit of the accumulating device (S 0 ), 

and where the output of the logical XOR gate is utilized as the state of the 
second control input (CO). 

Preferably, the number of iterations s utilized for carrying out the Montgomery 
multiplication is n+2, thereby an extended Montgomery multiplication result is 
obtained, in which n+2 iterations are performed. 

The method may further comprise allowing modular arithmetic operations to be 
carried out, by utilizing for the first (R2), second (RO), and third (Rl) storage 
devices an n+2 bits shift registers having a serial input into their most 
significant bit locations, and which may be capable of outputting their content 
in parallel, providing the first storage device (R2) with a serial output, from its 
least significant bit location (R2 6 ), and allowing it to perform cyclic bit rotation, 
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allowing the second storage device (RO) to receive on its serial input the least 
significant bit (S 0 ) of the accumulating device, providing a fourth storage device 
(R3) capable of serially outputting it content, bit by bit in sequence 
(R3j 1 = 0,1,2,..., « + l), starting from its least significant bit (R3 0 ) y the fourth 
storage device is capable of storing n+2 bits, and of performing cyclic bit rotation 
to it content, providing a fifth storage device (R4) having a serial input and a 
serial output, and which is capable of storing values of n+2 bits, providing a 
sixth storage device {RS) capable of serially outputting it content, bit by bit in 
sequence (R5 } I = 0,1,2,..., n + 1 ), starting from its least significant bit, the fourth 
storage device is capable of storing n+2 bits, providing a first arbitration device 
(MX1) having a first input from the fifth storage device (£4 7 ), and a second 
input from the circuitry producing the state of the second control input (Kj ), the 
output of the first arbitration device is attached to the second control input (CO), 
providing a second arbitration device (MX2) having a first input being equal to 
the least significant bit of the accumulating device (S 0> and also referred herein 
as CSA 0 ) 9 a second input received from the output of the circuitry (Kj), and a. 
third input connected to the serial output (R4j) of the fifth storage device (R4), 
the output of the second arbitration device is attached to the serial input of the 
fifth storage device (R4), providing a third arbitration device (MX3) having a 
first input which is constantly fed with a zero value ("<?')> and a second input 
received from the serial output of the fifth storage device (R4j) 9 the output of 
the third arbitration device is connected to a serial input of the accumulating 
device, providing a fourth arbitration device (MX4) having a first input 
connected to the serial output of the sixth storage device (if5 7 ), and a second 
input connected to the serial output of the fourth storage device (i?3 7 ), the 
output of the fourth arbitration device is connected to the first control input 
(C2), and providing an adder capable of performing serial addition of n+2 bit 
values, the adder receives a first input from the least significant bit location of 
the accumulating device (5 0 ), and a second input from the serial output of the 
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first storage device (R2), the output of the adder is connected to the serial input 
of the third storage device (Rl). 

Preferably, the accumulating device consist of n+2 addition and latching stages, 
each of which consists of a first and a second flip flop devices and a full adder 
device having three inputs, except for the first stage wherein the second flip 
flop is excluded. In each addition and latching stages the first input of the full 
adder is connected to the output of a first flip-flop device, the second input of the 
full adder is connected to the output of a second flip flop device of the 
subsequent addition and latching stage; and the third input of the full adder is 
connected to the respective bit output of the arbitration device 
(MUX, 0</<;i + l). 

The method may further comprise adding the output from the third arbitration 
device (MX3), via the serial input of the accumulating device, to the addition 
result of the (re+l)-th addition and latching stage by providing the (n+l^-th 
addition and latching stages with a first and second half adder devices, and a 
third flip flop device, connecting the input of the first flip flop device to the sum 
output of the second half adder, connecting the input of the second flip flop 
device to the carry output of the second half adder, and connecting the output of 
the flip flop device to the second input of the full adder of the (n+2)-th addition 
and latching stage, connecting the first input of the second half adder to the 
carry output of the full adder of the (TiH-l)-th addition and latching stage, and it 
second input, to the carry output of the first half adder, connecting the first 
input of the first half adder to the sum output of the full adder, and connecting 
the second input of the second half adder to the output of the third arbitration 
device (MX3); and connecting the input of the third flip flop device to the sum 
output of the first half adder, and connecting it output to the second input of the 
full adder of the (7i-l)-th addition and latching stage. 
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The state of the second control input (CO) can be determined utilizing the least 
significant bit of the second storage device (R0), the output of the fourth 
arbitration device (MX4), the carry output of the full adder of the first addition 
and latching stage, and the sum output of the fall adder of the second addition 
and latching stage. Preferably it is carried out by connecting the least 
significant bit of the second storage device (R0) and the output of the fourth 
arbitration device (MX4), to the inputs of an AND logical gate, providing an 
additional half adder and an additional flip flop device, connecting the first 
input of the half adder to the sum output of the full adder of the second addition 
and latching stage, and its second input to the carry output of the full adder of . 
the first addition and latching stage, connecting the sum output of the half 
adder to the input of the additional flip flop device, and connecting the output of 
the AND logical gate and the output of the flip flop device to the inputs of a XOR 
gate, and utilizing the output of the XOR gate to determine the state of the 
second control input (CO). 

The method may further comprise carrying out non-reduced Montgomery 
squaring of an integer value B> by loading the first (R2), second (RO), and third 
(Rl), storage devices with the values of the modulus N, the integer B, and the 
sum of the modulus and the integer (iV+B), respectively, setting the first (MXl) y 
second (MX2) y third (MX3) and fourth (MX4), arbitration devices to select the 
inputs of the circuitry for producing the state (Kj) of the second control input 
(CO), the circuitry for producing the state (Kj) of the second control input (CO), 
the zero value ("0"), and the output of the sixth storage device (R&)> respectively, 
loading the content of the sixth storage device (R5) with the content of the 
second storage device (R0), and loading the content of the accumulating device 
with a zero value, performing the non-reduced and extended Montgomery 
multiplication wherein the content of the sixth storage device (RS) is shifted by 
one bit to the right in each cycle, and obtaining the non-reduced Montgomery 
squaring result in the accumulating device. 
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The method may also comprise carrying out Montgomery multiplication of a 
first (A) and second (B) integer values, by loading the first (R2), second (RO), 
third (Rl), and fourth (R3) storage devices with the values of the modulus N, 
the second integer (B), the sum of the modulus and the second integer (N+B), 
and the first integer (A), respectively, setting the first (MX1),. second (MX2), 
third (MX3) and fourth (MX4), arbitration devices to select the inputs of the' 
circuitry for producing the state (K,) of the second control input (CO), the 
circuitry for producing the state (K,) of the second control input (CO), the zero 
value C(T), and the output of the fourth storage device (R3), respectively, 
loading the content of the accumulating device with a zero value, performing the 
non-reduced and extended Montgomery multiplication wherein the content of 
the fourth storage device (R3) is shifted by one bit to the right in each cycle, and 
obtaining the non-reduced Montgomery multiplication result in the 
accumulating device. 

The computation of the modular exponentiation A B mo&N can be carried out by 
pre-calculating an adjusted operand value A' = A*2* modN, composing an 
adjusted value for the exponent E = (e m _ 1>em _ 2 e, , e 0 ) by reversing its bit 
order and ehminating the most significant bit e BI _ 1 , to obtain the adjusted value 
E ' = ( e o > e \>-> e m - 2 ) 2 > loading the content of the first, second, third, and fifth, 
storage devices with the values of the modulus N, the adjusted operand (A' ), the 
sum of the modulus and the adjusted operand (N + A'), and the adjusted 
exponent value E' , respectively, obtaining the bit length m of the exponent ' 
value E and performing the following steps m-1 times: 

- right shifting the content of the fifth storage device (R4); 

- performing non-reduced Montgomery squaring to obtain the non-reduced 
Montgomery square of the content of the third storage device (R3) in the 
accumulating device; 
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- loading the content of the third storage device (R3) with the content of the 
accumulating device; and 

- loading the content of the third storage device (Rl) with the sum of the 
content of the first storage device (R2) and the content of the 
accumulating device; 

if the least significant bit ( R4 0 ) of the fifth storage device equals "1" performing 

non-reduced and extended Montgomery multiplication to obtain the non-reduced 
Montgomery multiplication result of the contents of the second storage device 
(RO) and the fourth storage device (R3), in the accumulating device, loading the 
content of the second storage device (RO) with the content of the accumulating 
device, and loading the content of the third (Rl) storage device with the sum of 
the contents of the first storage device (R2) and the accumulating device 
accumulating; 

After repeating these steps m-1 times the modular exponentiation result is 
obtained by performing non-reduced and extended Montgomery multiplication 
of the content of the second storage device (RO) by 1 to obtain the final reduced 
result in the accumulating device. US 

Alternatively, the modular exponentiation A E mod// can be computed by pre- 
calculating the adjusted operand value A 9 = A* I s mod N, loading the content of 
the first (R2), second (RO), third (Rl), and fifth (R4), storage devices with the 
values of the modulus N, the adjusted operand (A')> the sum of the modulus and 
the adjusted operand (N + A')> and the exponent value E, obtaining the bit 
length m of the exponent value E, setting a flag to "1", and performing the 
following steps m-2 times: 

right shifting the content of the fifth storage device (R4); 

if the least significant bit (RA Q ) of the fifth storage device equals "1" 

checking the state of the flag, and if it does not equal "1" performing non- 
reduced and extended Montgomery multiplication to obtain the non-reduced and 
extended Montgomery multiplication result of the contents of the second storage 
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device (RO) and the fourth storage device (R3), in the accumulating device, 
loading the content of the fourth storage device (MS) with the content of the 
accumulating device, otherwise loading the content of the fourth storage device 
(R3) with the content of the second storage device (RO) and resetting the state of 
the flag to "0"; 

performing extended and non-reduced Montgomery squaring to obtain 
the extended and non-reduced Montgomery square of the content of the second 
storage device (RO) in the accumulating device; 

loading the content of the second storage device (RO) with the content of 
the accumulating device; 

loading the content of the third storage device (Rl) with the sum of the 
content of the first storage device and the content of the accumulating device; 
After performing these steps m-2 times performing extended and non-reduced 
Montgomery multiplication to obtain the extended and non-reduced 
Montgomery multiplication result of the contents of the second storage device 
(RO) and the fourth storage device (R3), in the accumulating device, loading the 
content of the second storage device (RO) with the content of the accumulating 
device, loading the content of the third storage device (Rl) with the sum of the 
content of the first storage device (R2) and the content of the accumulating 
device, and performing extended and non-reduced Montgomery multiplication of 
the content of the second storage device (RO) by 1 to obtain the final reduced 
result in the accumulating device. 

A modular multiplication of a first (A = A 1 *2" +A°) and a second 
(B = 5 1 *2" + B°) integer values, where the first integer, second integer, and the 
modulus (2V), are of 2xn bits, can be calculated by computing the Montgomery 
multiplication (MMUl(a° ,B 0 )) of the n least significant bits of the first integer 

value (A 0 ) and of the second integer value (B°), by performing the following 
steps: 
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loading the first (R2), second (RO), third (Rl), and fourth (R3) storage 
devices, with the n least significant bits (N°) of the modulus value (N), the n 
least significant bits (5°) of the second integer value (B), the sum (B° +N°) of 
the n least significant bits of the modulus value (N) and of the n least significant 
bits (5°) of the second integer value (£), and the n least significant bits (A 0 ) of 
the first integer value (A), respectively; 

setting the first (MX1), second (MX2), third (MX3), and fourth (MX4), 
arbitration devices for selecting the input of the circuitry for producing the state 
(Kj) of the second control input (CO), the circuitry for producing the state (K r ) 
of the second control input (CO), the zero value ("0"), and the fourth storage - 
device (R3) input, and resetting the content of the accumulating device to zero, 
if it is required; 

carrying out Montgomery multiplication and obtaining the result (£(/)) in 
the accumulating device, and the bits state (Kj 0</</2-l) of the second 
control input ( K°) in the fifth register (R4); 

computing the value of A 0 *B l +N l * K° of the n least significant bits of the 

first integer value (A 0 ), the n most significant bits of the second integer value 
(JS 1 ), the 7i most significant bits of the modulus value (N l ),the n-bit value (K°) 
obtained in the fifth register (JR4), and the result obtained in step a) by 

performing the following steps: 

loading the first (R2), second (R0), third (Rl), and fourth (R3) storage 
devices, with the n most significant bits (N l ) of the modulus value (iV), the n 
most significant bits (B l ) of the second integer value (B)> the sum (B l +N l ) of 
the n most significant bits of the modulus value (N) and of the n most significant 

bits of the second integer value (B), and the n least significant bits (^°) of the 
first integer value (A), respectively; 

setting the first (ikQTl), second (MX2) 9 third (MX3)> and fourth (MX4) f 
arbitration devices for selecting the input of the fifth register (R4), the least 
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significant bit of the accumulating device (S 0 ), the zero value ("(5"), and the 
fourth storage device (R3) input; 

carrying out regular multiplication and obtaining the most significant 
bits of the result in the accumulating device (S (JJ) ) and the least significant bits 
of the result in the fifth storage device (i?( 4) ); 

computing result of addition of the Montgomery multiplication of the n most 
significant bits of the first integer value (A 1 ) and the n least significant bits of 
the second integer value (5°), with the result that was previously obtained 
( R 4(n) » s (u) )» by performing the following steps: 

loading the first (R2), second (RO), third (Rl), and fourth (R3) storage 
devices, with the n least significant bits (N°) of the modulus value (N), the n 
least significant bits (B°) of the second integer value (B), the sum (B° + N°) of 
the n least significant bits of the modulus value (N) and of the n least significant 
bits' (B°) of the second integer value (B), and the n most significant bits (A 1 ) of 
the first integer value (A), respectively; 

loading the content of the accumulating device (S, also referred to as CSA 
herein) with the n least significant bits of the previously obtained result (R4 {u) ), 

and loading the content of the fifth storage device (R4) with n most significant 
bits of the previously obtained result. ); 

setting the first (MX1), second (MX2), third (MX3), and fourth (MX4), 
arbitration devices for selecting the input of the circuitry for producing the state 
(Kj) of the second control input (CO), the circuitry for producing the state (K } ) 
of the second control input (CO), the input from the fifth storage device (R4), and 
the fourth storage device (R3) input; 

carrying out Montgomery multiplication and obtaining the result (S (m) ) in 
the accumulating device, and the bits state (K, 0£I<n-\) of the second 
control input (K l ) in the fifth register (R4); 
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computing A 1 *B l +N l *K l +S {m) of the n most significant bits of the first 

integer value (A 1 ), the n most significant bits of the second integer value (2? 1 ), 
the n most significant bits of the modulus value (TV 1 ), the rc-bit value (K l ) 
obtained in the fifth register (R4), and the result obtained in step c) (S^) by 

performing the following steps: 

loading the first (R2), second (RO), third (Rl), and fourth (R3) storage 
devices, with the n most significant bits (JV 1 ) of the modulus value (iV)> the n. 
most significant bits (B l ) of the second integer value (B), the sum (j? 1 +N l ) of 
the n most significant bits of the modulus value (2V) and of the n most significant 
bits of the second integer value (B), and the n most significant bits (A 1 ) of the 
first integer value (A), respectively; 

setting the first (MX1), second (MX2), third (MX3), and fourth (MX*), 
arbitration devices for selecting the input of the fifth register (R4), the least 
significant bit of the accumulating device (S 0 ) 7 the zero value ("0")> the 
fourth storage device (R3) input; and 

carrying out Montgomery multiplication and obtaining the most 
significant bits of the result in the accumulating device (S^) and the least 

significant bits of the result in the fifth storage device (R( JV )). 

The method may further comprise carrying out modular multiplication of a first 
(^ = 2]^ i *2 , ) and a second (5 = £5' *2' ) integer values, where the first 

integer, second integer, and the modulus (N^^N* *2')> may be of more than 

2x7i bits, where the computation is carried out by computing intermediate 
results of the multiplication of 2*n bits subsequent fractions of the first integer 
and second integer. 
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In another aspect the present invention is directed to an apparatus for carrying 
out extended and non-reduced Montgomery multiplication of a first (A) and 
second (B) integer values, in which the number of iterations (s) required is 
greater the number of bits (ri) in the modulo value (N), and in which the 
Montgomery multiplication result is smaller than twice the modulo value 
(2x2V), comprising: 

a first storage device (R2) for storing the modulo value (N); 

a second storage device (RO) for storing the value of the first integer 
values (A); 

a third storage device (Rl) for storing the sum of the first integer value 
and the modulo (A+N); 

an arbitration circuitry having a first (Inl), second (In2) and third (In3), 
inputs from the first (R2), second (RO), and third (Rl), storage devices, and 
having a fourth input which is zero ("0"), the arbitration device receives a first 
(CI) and a second (CO) control inputs, and thereby is capable of selecting one of 
it other inputs as it output, that is attached to the input of the accumulating 
device; 

circuitry for producing the state (Kj) of the second control input (CO) 
according to the state of a selected bit of the first integer value {A, ), the state of 
the least significant bit of the second integer value (B 0 ), and according to the 
state of the least significant bit of the accumulating device (S 0 ); and 

an accumulating device (S) capable of storing n+2 bits values, of adding 
n+2-bits values (X) to it content (S + X->S), and of dividing it content by 2 
(S/2->S); ' 

Preferably, the circuitry utilized for producing the state (Kj) of the second 
control input comprises: 

Circuitry for producing a value of one whenever: 
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the state of the selected bit (Aj) and the state of the least significant bit 
of the second integer value (B 0 ) are one, and the state of the least significant bit 
of the accumulating device (S 0 ) is zero; or 

the state of the selected bit (Aj) and the state of the least significant bit 
(B 0 ) of the second integer value are in different state, and the state of the least 
significant bit (S 0 ) of the accumulating device is one; 
the circuitry produces a zero value in all other cases. 

The first (R2), second (RO) 9 and third (Rl) storage devices can be n+2 bits shift 
registers having a serial input into their most significant bit locations, and 
which may be capable of outputting their content in parallel. The first storage 
device (R2) may also have a serial output, from its least significant bit location 
(£2 0 ), allowing it to perform cyclic bit rotation. 

The apparatus may further comprise means for allowing modular arithmetic 
operations to be carried out, comprising: 

means for connecting the serial input of the second storage device (RO) to 
the least significant bit (S 0 ) of the accumulating device (S); 

a fourth storage device (R3) capable of serially outputting it content, bit 
by bit in sequence (R3 S / = 0,1,2,..., n + 1), starting from its least significant bit 
(R3 0 ), the fourth storage device is capable of storing n+2 bits, and of performing 
cyclic bit rotation to it content; 

a fifth storage device (R4) having a serial input and a serial output, and 
which is capable of storing values of n+2 bits; 

a sixth storage device (R5) capable of serially outputting it content, bit by 
bit in sequence (R5 f I = 0,!,2,...,72 + l), starting from its least significant bit, the 
fourth storage device is capable of storing n+2 bits; 

a first arbitration device (MXl) having a first input from the fifth storage 
device (i?4 7 ), and a second input from the circuitry producing the state of the 
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second control input (K 7 ) 9 the output of the Boost arbitration device is attached to 
the second control input (CO); 

a second arbitration device (MX2) having a first input being equal to the* 
least significant bit of the accumulating device (S Q ), a second input received 

from the output of the circuitry (Kj), and a third input connected to the serial 
output (R4 f ) of the fifth storage device (R4), the output of the second arbitration 
device is attached to the serial input of the fifth storage device (R4); 

a third arbitration device (MX3) having a first input which is constantly 
fed with a zero value C& 9 ), and a second input received from the serial output of 
the fifth storage device (-R4 7 ), the output of the third arbitration device is 
connected to a serial input of the accumulating device; 

a fourth arbitration device (MX4) having a first input connected to the 
serial output of the sixth storage device (R5j) 9 and a second input connected to 
the serial output of the fourth storage device (it3 7 ), the output of the fourth 
arbitration device is connected to the first control input (Cl); and 

an adder capable of performing serial addition of /t-t-2 bit values, the 
adder receives a first input from the least significant bit location of the 
accumulating device (S Q ), and a second input from the serial output of the first 

storage device (R2), the output of the adder is connected to the serial input of 
the third storage device (Rl). 

The accumulating device may consist of ?z+ 2 addition and latching stages, each 
of which consists of a first and a second flip flop devices and a full adder device 
having three inputs, except for the first stage wherein the second flip flop is 
excluded, comprising: 

a) means for connecting the first input of the full adder to the output of a 
first flip-flop device; 

b) means for connecting the second input of the full adder to the output of a 
second flip flop device of the subsequent addition and latching stage; and 
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c) means for connecting the third input of the full adder to the respective bit 
output of the arbitration device (MUX, 0 < i < n + 1 ). 

The accumulating device may further comprise means for adding the output 
from the third arbitration device (MX3), via the serial input of the accumulating 
device, to the addition result of the (ra+l)-th addition and latching stage, 
comprising: 

a) a first and second half adder devices, and a third flip flop device; 

b) means for connecting the input of the first flip flop device to the sum 
output of the second half adder; 

c) means for connecting the input of the second flip flop device to the carry 
output of the second half adder, and for connecting the output of the flip 
flop device to the second input of the full adder of the (n+2)-th addition 
and latching stage; 

d) means for connecting the first input of the second half adder to the carry 
output of the full adder of the (n+l)-th addition and latching stage, and it 
second input, to the carry output of the first half adder; 

e) means for connecting the first input of the first half adder to the sum 
output of the full adder, and for connecting the second input of the second 
half adder to the output of the third arbitration device (MX3); and 

f) means for connecting the input of the third flip flop device to the sum 
output of the first half adder, and connecting it output to the second input 
of the full adder of the (n-l)-th addition and latching stage. 

The state of the second control input (CO) is can be determined utilizing the 
least significant bit of the second storage device (R0) y the output of the fourth 
arbitration device (AdX4), the carry output of the full adder of the first addition 
and latching stage, and the sum output of the full adder of the second addition 
and latching stage, comprising: 
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a) means for connecting the least significant bit of the second storage device 
(R0) and the output of the fourth arbitration device (MXT4), to the inputs 
of an AND logical gate; 

b) an additional half adder and an additional flip flop device; 

c) means for connecting the first input of the half adder to the sum output of 
the full adder of the second addition and latching stage, and its second 
input to the carry output of the full adder of the first addition and 
latching stage; 

d) means for connecting the sum output of the half adder to the input of the 
additional flip flop device; and 

e) means for connecting the output of the AND logical gate and the output of 
the flip flop device to the inputs of a XOR gate, and utilizing the output of 
the XOR gate to determine the state of the second control input (CO). 

Brief Description of the Drawings 
In the drawings: 

Fig. 1 is a block diagram schematically illustrating a prior art apparatus 
for carrying out multiplication and addition operations; 
Fig. 2 is a block diagram schematically illustrating a preferred 
embodiment of the invention for computing a non-reduced and extended 
Montgomery multiplication;. 

Fig. 3 schematically illustrates one preferred embodiment of the invention 
for generating the K i bit; 

Fig 4 is a block diagram schematically illustrating a preferred 
embodiment of the invention for carrying out modular arithmetic 
operations, utilizing Montgomery multiplication; 

Fig 5 schematically illustrates a process for computing interleaved 
Montgomery multiplication, according to a preferred embodiment of the 
invention; 
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Fig 6A and 6B schematically illustrates a possible embodiment of a CSA 
device according the method of the invention; and 

Fig 7A and 7B are flowcharts illustrating methods for carrying out 
exponentiation by utilizing the PKI apparatus. 

Detailed Description of Preferred Embodiments 

The present invention refers to a method and apparatus for carrying out 
modular arithmetic operations, which is fast and efficient in terms of hardware 
means. At the core of the preferred embodiment of the invention is the 
computation of the- modular multiplication of two integers A and B modulo N 
(hereinafter A • B mod N), based on a modified (extended) Montgomery method. 

A modifie d (extended) Montgomery multiplication - definition: For n bits long 
odd modulus N, integers A, B such that A>B£2*N , and an integer s > n , define 
the Non-Reduced and extended Montgomery Multiplication (NRMM) by 
NRMM {s) (A,B 9 N)=A*B*2- s mod(N + e*N), where £ = 0 for a reduced result, 
and s = 1 for a non-reduced result. For short, when the context (i.e., N and sj is 
known, NRMM {s) {A,B) will be used hereinafter to denote NRAJM {s) (A,B,N). The 
computation of NRMM {s) (A 9 B) is carried out by repeating steps 1.1, 1.2, and 1.3, 
s(>n) iterations, without performing the reduction step 1.4. Hereinafter the 
result of such computation is also termed as non-reduced and extended 
Montgomery multiplication. It is important to note that the result obtained by 
this non-reduced and extended Montgomery multiplication is not necessarily 
reduced (i.e., NRMhd {s) (A,B,N) may be greater that the modulus N). 

A process for computing NM<!M {s) (A,B) is given by the following steps: 



WO 03/001362 



- 31 - 



PCT/IL02/00318 



Process It 

Input: A, B, N, s, n (Precondition: Nis an n-bit integer with A, B < 2*N, AT is 
odd, and s^n) 
Output: NRMM (l) (A,B) 
5=0 

For, J" from 0 to s-1 do 

3.1. S = S + A {I) *B 

3.2. S = S + S 0 *N 

3.3. S = S/2 
End for 
Return 5 

The special case where A, B < N and s=n is the classical Montgomery 
multiplication which is used in most applications where the final reduction step 
is ignored. According to the method of the invention this process is performed 
without performing reduction (step 1.4), and in a preferred embodiment of the 
invention, s=n+2 is utilized, wherein for inputs bounded by 2*N, the result 
obtained is also bounded, by 2*N, although it is sufficient to require that 
B <2*N and that A is not of more that n+1 bits. 

The method of the present invention is based on the following facts: when 
performing s=n+2 iterations, with n bits long modulus N, (n+1) bits long input 
values A and B (where A, B < 2*N), the final result of NRMM ( - s) {A,B) does not 
exceeds 2*N, and the temporary accumulated results (step 3.2) do not exceed 
6*N . This observation is of significant importance, since it allows for successive 
applications of this extended and non-reduced Montgomery Multiplication, in 
which the input and the output values are bounded by the same upper bound 
(2*JV), thus eliminating potential overflows. As explained before, the 
exponentiation process A B modN can be implemented by means of a sequence 
of Montgomery multiplications and Montgomery squaring. A MMUL(a,A) 
operation with an n bits long operand A (A<N) may produce a non-reduced 
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result larger than N but smaller than 2*N. Thus, non-reduced Montgomery 
Multiplication with s=/H-2 rounds allows performing a continuous 
exponentiation sequence of NRMM^s without a need for reduction in the 
intermediate steps, with storage registers of length (n+2) bits and accumulator 
capable of computing up to (n+3) bits results. As will be explained hereinafter, 
an implementation' of (ti+2) bits accumulator (GSA) may be utilized according to 
the method of the invention. Moreover, s=n+2 is the minimal number of rounds 
that guarantees such exponentiation without reduction. 

The computation of the non reduced extended Montgomery multiplication is _ 
implicitly based on adding the value K ■ N (for Some K > 0 ) to the product A*B . 
The value of- if is not known in advance, and is constructed iteratively. In the 
preferred embodiment of the invention, in each iteration of the process, another 
bit Kj of the integer K is computed, as will be described hereinafter. The 
modulus value N may be added to the product of A * B any number of times, and 
could still be considered as the same result modulo N> that is, the result after 
adding K*N yields the same residue modulo N if it is reduced to the range [0, N). 
The value of K is chosen in away that A* B + K*N is divisible by 2 s . The result 
A*B + K*N is divided by 2 s (shifted to the right s times), for disposing of s 
zeros from the result's LSBs. Thus, the result is actually the outcome of the s 
successive Right Shift (RSH 5 ) operation, RSH S {A* B + K*N) = {A*B + K*N)/2 s , 
wherein RSH S (X ) = X * 2~* denotes s shifts of X to the right. These shifts are 
performed in each iteration (step 3.3). 

The NRMM^ performed according to the method of the invention consists of 
s=ti+2 iterations, in which a value is added to an accumulated result. The value 
that is added to the accumulated result, in each iteration, is chosen such that 
the temporary cumulative addition result of step 3.2 is an even number. 
Therefore, the LSB bit of the temporary value of the cumulative result is always 
zero, and it can be divided by 2 (step 3.3) by means of one right shift. 
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More particularly, whenever the computation result of S = S + Aj *B is an odd 
value, the (odd) modulus N is added to S. Thus, in each iteration the following 

calculation is performed^ f^' * * if S + A >* £ ™ n Therefore, the 

[S + Aj*B + N if S + Aj*B odd 

result may be always divided by 2, without a remainder (i.e., by a right shift). 

According to a preferred embodiment of the invention, a modification of the 
classical Montgomery multiplication method is utilized to facilitate 
implementations for modular arithmetic computations, which can be realized 
completely by hardware. In prior art methods for computing the classical 
Montgomery multiplication, the computation of MMUl(A, B) = A*B*2' n mod .AT 
is obtained in a process of n iterations, wherein n is the number of bits in the 
modulus N. There is a substantial advantage in performing more than n 
iterations in this computation, as previously discussed. In a preferred 
embodiment of the invention, s=rc+2 is utilized, and the following arguments 
hold for this type of Montgomery multiplication: 

When performing s=n+2 iterations to compute NRMM^{a,B), with n bits long 
input values A and B, (A, B < N), and with n bits long modulus N, all the bits of 
A are scanned, the final result does not exceeds N+B < 2*N and the temporary 
accumulated results do not exceed 2* (N+B) < 4*2V. 

Moreover, when performing s=n+2 iterations to compute the non-reduced and 
extended NRMM^(a,B), with (n+1) bits long input values A and B, (where A, B 
< 2*N), and with n bits long modulus N, all the bits of A are scanned, the final 
result does not exceeds (N+B+N)/2 < 2*N and the temporary accumulated 
results do not exceed 2*(N+B) < 6*N. 
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It is important to note that when performing s=ti+2 iterations to compute 
NRMM {s) (A 9 l) with (n+1) bits long input value A (A < 2*N), and with n bits long 
modulus N 9 all the bits of A are scanned, and the final result obtained is 
reduced, i.e., is smaller than N. 

As a result, when a chained sequence of non-reduced Montgomery 
multiplications is performed, with an n bits long modulus N, and inputs that are 
bounded by 2*ZV, the outputs remain bounded by 2*N, and one (final) extended 
Montgomery multiplication by 1 reduces the result to the range [0, N) (without 
actually performing the reduction of step 1.4). 

•--*'« 

The latter observations are of significant importance in applications. As 
explained before, the exponentiation process A E mod// (A<N) can be 
implemented by means of a sequence of Montgomery multiplications and 
Montgomery squaring (MMULQCyA), MMUL(X y X)) operations, that even with an 
n bits long operand X (X<2V), and certainly with an/i+1 bits operand X < 2 * N , 
may produce a non-reduced result larger than N but smaller than 2*N . The 
modified Montgomery Multiplication (non-reduced) with s=n+2 rounds allows 
performing a continuous exponentiation sequence of NRMM^s without a need 
for reduction in the intermediate steps, with storage registers of length (n+2) 
bits and accumulator of length (/t+3) bits (i.e., an (n+2) bits long accumulator 
that includes one additional bit for a carry). Moreover, s=n+2 is the minimal 
number of rounds that guarantees such exponentiation without reduction. 

Example 3: in the following example the modified Montgomery Multiplication 
is utilized for calculating the exponent A B modN 9 for ,4 = 212, 
E = 240 = (11110000) 2 (m = 8 ), and N = 249 (n = 8 , as in Example 2). The modified 
Montgomery multiplication is carried out by performing s = 77 + 2 = 10 iterations, 
and thus the precalculation of A' = 212 * 2 10 mod 249 = 209 is required. 
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TABLE 3: (Precondition: 4 = 212, £ = 240 = (llll0000) 2 , JV = 249, and 
T (y) =A' = 209) 



I 


E, 






Hi) 


6 


1 


209 


235 


269 


5 


1 


269 


121 


254 


4 


1 


254 


241 


296 


3 


0 


296 


319 


319 


2 


0 


319 


175 


175 


• 1 


0 


175 


160 


160 


0 


0 


160 


25 


25 



In table 2, the value obtained in the preceding step is followed by the result 
obtained in step 2.1 T (1+1) 2 , and the result obtained in step 2.2, T {1) . The final 
result is obtained by computing T (o) = JVRMW w (r (0) ,l)=241. As shown, the results 
of the intermediate Montgomery multiplications that were performed were not 
reduced. In the operation of step 2.2 performed in iterations 1=6, 5, 4, and 3, the 
results were NRMM^(t {1) ,A')> N , and for the operation of step 2.1 in the 
iteration 7=3 the result NRMM {s) (t {i+i) ,T 0+]) )> N . As discussed before, the non- 
reduced Montgomery multiplications are bounded, and do not exceed 2*N. Table 
4 exemplifies the benefits of the modified Montgomery Multiplication, for the 
calculation of NRMM {s) {319,319), as performed in step 1=3 in Table 4 
hereinabove. 
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TABLE 4: (Precondition: 5 = 0, ,4 = 319 = (10011 llll) 2 , B = 319 , and N = 249) 



r 
I 




i3 = O + (?) 




O=u+u 0 *iV 


o — o / Z 


0 


1 


319 


1 


568 


284 


i 


1 


603 


1 


852 


426 


2 


1 


745 


1 


994 


497 


3 


1 


816 


0 


816 


408 


4 


1 


727 


1 


976 


488 


5 


1 


807 


1 


1056 


528 


6 


0 


528 


0 


528 


264 


7 


0 


264 


0 


264 


132 


8 


1 


451 


1 


700 


350 


9 


0 


350 


0 


350 


175 



The result obtained is 319*319*2" 10 mod249 = 175, and evidently all the 
temporary accumulated results are bounded by 6*N . It should be noted that for 
7=5 a temporary result of S = S + S Q *N = 1056 = (l 00001 00000) 2 is obtained, which 

is of 11 bits (ti+3). In fact, this is the maximal bit length that is required for such 
calculations utilizing the non-reduced Montgomery Multilication, and therefore 
the CSA should be capable of computing results that are up .to n+3 bits. 
However, due to the continuous right shifts that are performed in the CSA in 
each operation, it is implemented as an n+2 bit CSA. 

The Kj bit takes the value S 0 , the LSB of the partial result 5 = 5 + ^*5, which 
is realized in each iteration. This value (Kj) is completely determined by the 
least significant bits of the results of the previous iteration, and other known 
values, and can be realized by K } = {A } -B q )®CSA[ , were CSA[ (603) is an output 

obtained from the CSA. As will be explained in details with reference to Fig. 6, 
with some additional hardware the CSA can provide the CSA[ (603) output 
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which is used to speed up the process of producing the K, bit. This realization 
can. be easily implemented in hardware. An apparatus based on the 
determination of K It according to a preferred embodiment of the invention, is 
illustrated in Fig. 2. An additional shift register, R3, is used in this apparatus 
for feeding the A } bits of A . The R3 register has a serial output, and it consists 
of s bits for holding the value of A , in its n LSBs, and the two additional (zero) 
bits in its 2 leftmost MSB locations, which are utilized for carrying out two 
additional iterations (s=rc+2). ' The CSA, which is of s+2 bits, acts as an 
additional storage device, and thus there is no need for an additional storage 
device for partial results that are obtained in intermediate steps. 

In the preferred embodiment of the invention, the value of K, is realized from 
the values of A, , R0 0 , and CSA[ (603). With reference to Fig. 2, the value of K, 
is realized utilizing appropriate circuitry 602 (for which a possible 
implementation is illustrated in Fig. 3), which receives A t , R0 0 , and CSA[ , as 
inputs. The bit B 0 is placed in a latching device 200, which receives the LSB of 
register RO (R0 0 ). To carry out the calculation of NRMM {s) (a,B), the system is 
initialized by loading the values B, B+N, N, and A, into the respective registers, 
R0, R1, R2, and R3, and by zeroing the content of the CSA. Thus K 0 will equal 
'T'onlyif 4,=5 0 =1. 

It should be understood that when Montgomery Multiplication is performed, and 
N is odd, the content of the CSA is always even, which enables the division by 2 
to be carried out by means of one right shift, without a remainder. In addition, 
the LSB of the CSA is obtained on the CSA Q output, and hence, in case there is a 
remainder (regular multiplication), it is obtained on the CSA^ output. 

Fig. 3 demonstrates one possible implementation of a circuitry 602 for providing 
the K, bit. The realization in Fig. 3 is carried out utilizing an AND gate 300 and 
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an Exclusive Or (XOR) gate 301, wherein the inputs of the AND gate are the 
bits A } and B 0 , and the XOR gate inputs are the output of the AND gate 300, 
and CSA[ 603. The CSA{ 603 output from the CSA produces an expected value 
for the CSA LSB, and therefore speeds and simplifies the realization of the K f 
bit/ 

The method of the invention, as described and exemplified hereinabove, is 
utilized for a fast and efficient computation of the extended and non-reduced 
Montgomery multiplication NRMM^(A,B), wherein A and B are smaller than 
2 *7V\ and N is up to n bits (and s > n + 2). This apparatus can be modified to 
allow modular products computation of integers, which have more the n-bits, 
which is also known as the Montgomery interleaved modular multiplication, as 
will be discussed later. 

Fig. 4 depicts an apparatus, according to a preferred, embodiment of the 
invention, for carrying out arithmetic operations based on the extended non- 
reduced Montgomery modular multiplication. The apparatus, also termed Public 
Key Interface (PKI) herein, is based on 6 registers (each of n+2 bits), R0 y Rl, R2, 
R3, R4, R5 and a Carry Save Adder (of n+2 bits), CSA, with some control (not 
shown). The PKI apparatus is capable of performing various arithmetic and 
modular arithmetic operations, as will explained hereinbelow. 

In the apparatus of Fig. 4, the additional multiplexers, MX1, MX2> MX3 and 
14X4, and the shift registers R4 and R5, are introduced. The control input Cl of 
the MUX is connected to the output of MX4 9 which acts as an arbitrator for 
selecting between the serial outputs of registers R3 and R5. Registers R2, R3 
and R4, have serial inputs and serial outputs, and are capable of performing 
cyclic bit rotation. The other MUX control input, CO, is connected to the output 
of MXl y which acts as an arbitrator to select the input value from register R4, or 
from the circuitry that produces the value K } . The register R4 has a serial input, 
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an Exclusive Or (XOR) gate 301, wherein the inputs of the AND gate are the 
bits Aj and B 0 , and the XOR gate inputs are the output of the AND gate 300, 

and CSA[ 603. The CSA[ 603 output from the CSA produces an expected value 
for the CSA LSB, and therefore speeds and simplifies the realization of the K J 
bit/ 

The method of the invention, as described and exemplified hereinabove, is 
utilized for a fast and efficient computation of the extended and non-reduced 
Montgomery multiplication NRMM^^A.B), wherein A and B are smaller than 
2*N, and N is up to n bits (and s^n + 2). This apparatus can be modified to 
allow modular products computation of integers, which have more the ra-bits, 
which is also known as the Montgomery interleaved modular multiplication, as 
will be discussed later. 

Fig. 4 depicts an apparatus, according to a preferred, embodiment of the 
invention, for carrying out arithmetic operations based on the extended non- 
reduced Montgomery modular multiplication. The apparatus, also termed Public 
Key Interface (PKI) herein, is based on 6 registers (each of bits), R0, Rl> R2, 
R3> R4, R5 and a Carry Save Adder (of n+2 bits), CSA, with some control (not 
shown). The PKI apparatus is capable of performing various arithmetic and 
modular arithmetic operations, as will explained hereinbelow. 

In the apparatus of Pig. 4, the additional multiplexers, MX1 9 MX2, MX3 and 
]\4X4, and the shift registers R4 and J?5, are introduced. The control input Cl of 
the MUX is connected to the output of MX4, which acts as an arbitrator for 
selecting between the serial outputs of registers R3 and R5. Registers R2, R3 
and R4, have serial inputs and serial outputs, and are capable of performing 
cyclic bit rotation. The other MUX control input, CO, is connected to the output 
of MXl y which acts as an arbitrator to select the input value from register R4> or 
from the circuitry that produces the value JST 7 . The register R4 has a serial input, 
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which is connected to the output of MX2, which acts as an arbitration for 
selecting between the input of the CSA* value, the output of R4 (useful when 
cyclic bit rotation of R4 is performed), or the value of K } 602. 

The third multiplexer, MX3, selects the input to the CSA serial input, and may 
select a "0" value or the output of MX4. The output of MX3 is added to the n-ih. 
bit of the CSA, so that in each step the CSA content is set by performing the 
calculation of CSA {M) = (c&fy) + out (J) + MX3 (J) * 2" )/2 (where out {J) and AQr3 (/) are 

the outputs from the MUX and MX3 devices respectively), as will be discussed 
herein. It should be noted that register R5 is utilized only for carrying out 
squaring operations which are involved in more complex arithmetic 
computations (i.e., exponentiation). It will be shown that for performing 
squaring operation register R5 is loaded with the content of register RO. 
Therefore, one may implement the same apparatus without register R5, and 
read the subsequent bits of register R0 utilizing multiplexing techniques. A 
possible embodiment of the CSA is illustrated in Figs. 6A and 6B. 

The CSA illustrated in Figs 6A and 6B is based on a. serial approach, wherein a 
set of n Full Adders (FA) are serially connected. The CSA 600 depicted in Fig. 
6A is an n bits CSA, in which each FA has 3 inputs', and 2 outputs, a Carry (C) 
and Sum (S), each of which is the input of a Flip-Flop (FF) device. Each FA 
receives the following inputs: the output of the FF which receives the <S output of 
the subsequent FA; the output of the FF which receives its own C output, and a 
corresponding input from the MUX (MUX^i, MUX„. 2 MUX 0 ). In this way, the 
right-shift of the CSA content, and the addition of the MUX output, out, are 
effected. The leftmost FA device 610 receives an input from another two stages, 
61 1 and 612, depicted in Fig 6B. 

The additional stages, 611 and 612, depicted in Fig. 6B are utilized to expand 
the n bit CSA 600 of Fig. 6A, into a (n+2) bit CSA. The ?i-th stage 611 in Fig. 6B, 
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is utilized for the addition of MX3 {/) *2" to the CSA content. Although it is 
shown that the addition of 4 bits is performed by the n-th stage 611, it should be 
understood that in practice only 3 bits are summed by this stage. More 
particularly, when performing the Montgomery based computations, the input 
received from MX3 is always in zero state, and when performing regular 
multiplication, which are part of an interleaved multiplication, the input 
received from the (/i+l)-th stage 612 is in zero state. 

To accelerate the system performance, the C output 604 of the first stage FA, 
and the S output 60S of the second stage FA, are connected to the Half Adder 
(HA) 607 which its S output is connected to a FF from which the output CSA[ 
603 is provided for the circuitry utilized for determining K l . The HA 607 may be 
replaced by a logical XOR gate, or any device capable of realizing the © 
operation (i.e., base 2 modular addition). It should be also noted that the serial 
output of the. CSA, CSA 0 is not provided via an FF device, but instead it is 
obtained directly from the S output of the first stage's FA. 

The application of various arithmetic operations, according to a preferred 
embodiment of the invention, is described in the following discussion. While this 
is a limited set of operations, it does not limit the application of a wider set 
comprising other possible operations, utilizing the method of the invention, and 
is therefore introduced here only for the purpose of illustration. 

Montgomery square ( NRSQJt^) 

The following process is utilized for the computation of 

CSA = (B*B+K*N + CSA)/2\ and therefore provides the Non-Reduced and 

Extended Montgomery Squaring of an integer' value B, NRMM {s) (B,B). The 

number of rounds is s>n, however it is shown that the optimal choice is 
s = n + 2 . 
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Input: B, N, s R0,B + N-*RI, N -+ R2) 

Output: NRSQRM = NRMM {s) (B,B) 



R0-+R5 

For I from 0 to 5-1 do 

Kj =LSB{CSA + R5 1 *R0 0 ) 

f 0 if R5j=0 Kj=0) 

RO if R5, =1 K f =0 

R2 if R5,=0 K, =1 

Rl if R5j =1 Kj =1 J 



C&4 = 



/2 



End for 
Return CSA 



For this calculation, the control inputs of MXl, MX2, MX3, and MX4 are set to 
select the input of K, , K,, "0", and R5 respectively. It should be noted that for 
this computation the input selection made for MX2 does not affect the result. 
When this operation is performed as part of an interleaved multiplication the 
control input of MX3 is set to select the R4 input. After performing s iterations, 
the value of K is obtained in the R4 register. The content of R5 may be loaded 
(Fig. 5) with the content of register R0, utilizing conventional parallel/serial 
techniques (not illustrated) or by software. It should be understood that the 
NRSQR process may be utilized to compute (B* B + K*N+CSA)/2 S , or 
(B*B + K*N)/2* by zeroing the content of the CSA in the initialization steps. 

Non-reduced and extended Montgomery multiplication ( NRMM w \ 
The non-reduced Montgomery multiplication implemented by the PKT 
apparatus, is described according to the method of the invention. The following 
process calculates the non-reduced result CSA = (A*B + K*N + CSA)l2* . 
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InputrA, B, N, s (A^>R3, B->R0, B + N-+R1, N -> R2) 
Output: NRMM {s \a,B) 



For I from 0 to s-1 do 

K, =LSB{CSA + R3j*R0 0 ) 

f TO if R3j=0 £,=0^ 

R0 if R3, =1 K,=0 

'R2 if R3,=0 K, =1 

Rl if R3j =1 K, =1 J 



GSL4 = 



CSA + 



12 



End for 
Return CSA 

The control inputs of MK1 and are set to select the inputs of K, and R3, 
respectively. The control inputs of MX2 and MX3 axe set to select the inputs of 
K, and "0", respectively, when a simple NRMM^ is performed, or alternatively, 
the input of-K, and R4, respectively, as part of an interleaved multiplication 
(illustrated in Pig. 5). As previously mentioned, the yalue of iTis obtained in the 
R4 register as the s cycles of the calculation are completed. Of course the 
NRMM^ process may be also utilized to compute {A* B + K*N)/2 S , by zeroing 
the content of the CSA in the initialization steps. 

Montgomery multiplication bv 1 ( MMULBY\ { ' )s l 

The following process is utilized for computing CSA = (b + K*N + CSA)/2 s , for 
some value B, utilizing the PKI apparatus, according to the method of the 
invention. As previously explained, for B<2*N and s=n+2, the result obtained 
by the MMULBY\ {s) {b) operation is reduced (for B<2*N and s=n+2 
MMULBYl (s) (B)<N). 



Input:B,2V, s (B -» R0, B + N -» Rl , N -» R2 1->J?3) 
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Output: MMULBY&\B) = NRMM {s) {B,\) 

k 0 = lsb{csa+ro 0 ) 

{ \Rl if £ 0 =lJ 

For I from 1 to s-1 do 

K, = CSA 0 

{ \R2 if K^l) 

End for 
Return CSA 

The control inputs of MX7, MX3, and MX4 are set to select the input of K, , "0", 
and R3 respectively (the selection of MX2 does not affect this operation). The 
" value of K is obtained in the R4 register, and the final result is obtained in the 
CSA, as the s cycles of the calculation are finished. It should be noted that 
instead of loading R3 with - the value of 1 (n+2 bits), an external control may be 
utilized for forcing "1" at the MX4 output, at the first cycle, and "0" at the 
remaining cycles (illustrated by dashed lines in Fig. 4). As before, the 
computation of (B + K*N)/2 S can be obtained by zeroing the content of the CSA 
in the initialization steps. 
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Regular multiplication (RMUL) 

There are various ways of implementing regular multiplication utilizing the PKI 
apparatus, according to the method of the invention. The following process is 
one possible way for computing CSA:R4 = A*B + C*D + CSA (the content of the 
CSA holds the results of the previously performed operation, or alternatively it 
may be set to a desired value). The MSB of the RMUL operation is obtained in 
the CSA, and the LSB in R4. 

Input: A, B, C, D, n (B->~R0, B + D^R1, D-+R2, A-*R3,C-±R4) 
Output: RMUL(A,B,C,D)= A*B + C*D + CSA 

For I from 0 to n-1 do 



CSA = 



CSA + i 



0 if R3j = 0 R4j = 0] 

RO if R3, =1 R4j =0 

R2 if R3, =0 R4j =1 

Rl if R3, =1 R4 { =1 



R4 = R4/2 + CSA 0 *2 n ~ i 

CSA = CSA 1 2 
End for 

Return CSA & R4 



The control inputs of AdXl, MX2, MX3, and MX4 are set to select the inputs of 
R4, CSA 0 , "0", and R3, respectively. After performing n iterations, the n LSBs of 

the result are obtained in the register R4, and n MSBs of the result are obtained 
in the CSA. 
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Montgomery exp onent 

The PKI application of an exponent calculation is based on the exponent process 
that was described hereinabove, for computing. A B mo&N(A<N with no lose of 
generality). For carrying out this calculation with the PKI apparatus, the pre- 
calculated value A' = A*2 S modN is required. For this particular process, an 

adjusted (truncated) value E' for the exponent E = (e a _ l? e m _ 2 e 0 ) is. required, 

wherein the MSB e m _, is eliminated, and the bit order is reversed, thus 
obtaining E' = (e 0 , e, e m _ 2 ) 2 (m is the number of bits in E). 

process 2: 

Input: m, A', N, E' (A'-+R0, A' + N-+Rl ,N-+R2, A'-±R3, E' -» R4) 
Output: CSA = A E modN (left-to-right approach) 

For I from 0 to m-2 do 

0-+CSA 

41 R0 = NRSQR {s) (R0) 

4J2. Rl = R0+R2 

4.3. lf R4 j = 1 than 0 CSA R0 = NRMM (s) (R0,R3) . ri=ro+R2 

End for 
0-+CSA 

MMULBY^ s) {RQ) 

Return CSA 

A sequence of Montgomery squaring and multiplication are performed in the 
loop, in the above process. The operation of the PKI apparatus utilizing process 
2 is further illustrated in Fig 7A, in a form of a flowchart. The operation is 
initiated in steps 730 and 731, in which the values A\E\N, and m-\ are input 
to the PKI apparatus. A sequence of operations (steps 4.1. to 4.3. here above) are 
performed in a loop starting in steps 732a and 732b, where a right shift is 
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performed to the content of register R4, the CSA content is zeroed, and an 
NRMSQR (s) of the content of R0 is performed. In step 732c the NRMSOR {s) result, 
which is obtained in the CSA, is loaded into register RO, and the addition result 
of the content of the CSA and the register R2 is loaded into register Rl. 

The operation of step 4.3. of the exponent process hereinabove is carried out in 
step 732d, where the LSB of R4 is examined, and if it equals "1" the CSA 
content is zeroed and a NRMM^ of the content of registers RO and R3 is 
performed, the result of which is then stored in i?0 and also added to the content 
of R2 and stored in the register Rl. The operation proceeds in step 732e, in - 
which the value of the loop index i is decrement by 1, and in step 732f it is 
checked if the loop index i equals zero. If i is not zeroed another iteration of the 
process is performed, as the operation is proceeded in step 732a, otherwise, the 
CSA content is zeroed and a MMULBYl^ operation is performed to the content 
of RO. The exponentiation (reduced) result is obtained in the CSA after 
performing the MMULBY\^ s) operation to eliminate the T element. 

It should be understood that the process illustrated in Fig. 7A is carried out . 
utilizing an external control (not shown). This control may be performed by 
software utilizing a processor/controller, or by the addition of dedicated 
hardware. 

Other exponentiation processes, such as right- to-left binary exponentiation, m- 
array exponentiation, and sliding windows exponentiation, can also be 
implemented analogously ^Handbook of Applied Cryptography'' by Alfred J. 
Menezes, Paul C. van Oorschot and Scott A. Vanstone, CRC Press October 
1996), 

An example for one additional exponentiation method utilizing the PKI 
apparatus is disclosed in the following process. In this process (right- to-left 
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binary exponentiation), the exponent value is utilized directly, the adjustment of 
its bits is not required 

process 3; 

Input: m(>l), A' , N , E (A' -» R0, A' + N -+ Rl ,N -» R2 , A' -+ R3 , E -» RA) 
Output: CSA = A E modiV 

Flag=l 

For. I from 0 to m-2 do 
5-1 if (Flagyl) and C-R4 7 =1) then R3 = R0; Flag=0 

5 - 2 Else IF (^=1) then 0-+CSA ; 

R3 = NRMM ( *\R0,R3) 

0-»GS4 

5.3 R0 = NRSQR^(R0) 

5.4 R1 = R0+R2 
End for 

R0 = NRMM {s) (R0,R3) 
Rl = R0+R2 
MMULBY\W{R0) 
Return CSA 

The PKI operations in this process are illustrated in Fig. 7B. This process is 
initiated in steps 750 and 751, in which the values A',E',N, and m-1 , are input 
to the PKI apparatus, and a Flag is set to "1". The operations performed in steps 
5.1. to 5.4. in the exponent process here above, begins in step 752a, in which a 
right shift is performed to the content of register R4. In step 752b the LSB of R4 
is examined, and if it equals "1" another test is performed in step 752c, to 
determine if the Flag is in the state of "1". If the Flag state is "1", register R3 is 
loaded with the content of register R0, and the flag state is reset to "0". 
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Otherwise, if the Flag state is "0" in step 752c, the CSA content is zeroed and a 
NRMM {s) operation is performed to the content of registers RO and R3, the result 
of which is obtained in the CSA, and which is then loaded into the R3 register. 
The operation continues by passing the control to step 752d. 

If the state of the LSB of the R4 register is not "1", in step 752b, the operation 
proceed in step 752d, where the CSA content is zeroed and a NRSQR {s) operation 
of the content of RO is carried out, the result of which is obtained in the CSA 
The NRSQR {s) result is then loaded into register RO, and it is also added to the 
content of register R2. The addition result of the contents of the CSA and ' 
register R2 is stored in register Rl. The process proceeds in step 752f, in which 
the loop index i is decrement by 1. In step 75 2e, i is examined to determine if it 
equal zero. If i is not zeroed, another iteration is performed as the control is 
passed to step 752a. Otherwise, the CSA content is zeroed and a NRMM^ 
operation of the RO and R3 contents is performed, the result of which is obtained 
in the CSA, and loaded into register RO. The addition of the contents of register 
R2 and the CSA is stored in register Rl, the CSA content is zeroed and a 
MMULBYl^ is performed. The final result (reduced) is then obtained in the CSA. 

As explained before, an external control is utilized to carry out the steps of this 
operation. 

Allowing flexibility in choosing different implementations of exponentiation 
processes is of importance in applications. For example, a right-to-left 
exponentiation process enables utilizing two PKI apparatus in parallel. 

It should be also appreciated that the method of the invention substantially 
improves the security of the PKI apparatus, particularly against attacks, which 
are based on the detection of subtraction operation, as performed in the 
conventional Montgomery Multiplication methods. In such attacks methods the 
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user's secret (private) key is computed by revealing the reduction operations 
performed (W. Schindler "A Timing Attack against RSA with the. Chinese 
Reminder Theorem", Second .International Workshop Worcester, MA, USA, 
August 2000). A common method, which is currently used, against such attacks 
is to perform additional (dummy) subtraction operations, which of course 
consumes more time and power. Since in the method of the invention 
subtractions are not performed, it is not possible to reveal the secret key 
utilizing such methods. 

As was mentioned hereinabove, the method of the invention can be utilized to 
implement a right-to-left exponentiation process with two PKI apparatus 
operating in parallel. As will be appreciated by those having skill in the art, 
such a parallel implementations further improves the security of the system. 
Since it is difficult to follow and identify when and which operations are 
performed by such a parallel system, the opponent task becomes even more 
problematical. 

Montgomery interleaved multiplication 

In Fig 5 the values loaded into each register (RO, Rl, R2, R3, and R4), and the 
input selection of each of the multiplexers (MX1, MX2, MX3, and MX4), are 
described, for different steps (1,11, III, and IV) of the Montgomery interleaved 
multiplication. At each step, the registers are loaded with the respective values, 
the MUXs control input is set to provide the corresponding input, and a process 
of s iterations is performed, for calculating the respective product. 

In the following discussion, the Montgomery interleaved modular multiplication 
of A • B mod N , wherein A, B, and N, are 2/i-bit values, is described. Each of the 

integer values, A, B, and N, is treated as a pair of rc-bit partial values. The 
partial values of A = A ] *2 n +A\ for example, are denoted as follows; 
A = [a\A*), wherein A 1 denotes the n MSBs of A, and A* denotes the n LSBs of 
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A Similarly, the partial values, of £ = 5 l *2 w +5 0 and N = N l *2 n +N° , are 
denoted by B = (b\B°), and N = (n\N°). This embodiment may be further 
modified (with software) to allow computation of A • B mod N , for A, B, and 2V, of 

any length. In other forms, each integer may consist of I partial values, each of 
which is of ?z-bit. 

In step I, the computation of (a 0 *5° *K 0 )l2~ n is performed by loading 
registers RO, Rl, R2, and R3, with 5°, 5° +N\N° , and A 0 , respectively. In 
addition, the control inputs of MX1 9 MX2, A4X3., and are set to select the 

inputs of K J9 K I9 "0", i*3, respectively. The result *5° +N° *K°)/X n 
A°*B° *2~ 5 modW 0 remains in the CSA. Since in this step MX2 selects the K } 

output, register R4 is. loaded with bits of the K° value, which are required for 
the computation of the next step. 

In step II, regular multiplication is performed, to calculate 
4° • 2? 1 +N l -K° +CSAq) 9 wherein CSA^ is the result that was obtained in the 

previous step, step I. The values B l 9 B l +N l ,N\ and A° , are loaded into the RO, 
Rl } R2, and R3, registers, respectively, and the control inputs of MXl, MX2, 
MX3„ and MX4, are set to select the inputs of R A , CSA 0 , "0" , R3, respectively. It 
should be noted that the right shift of the bits of R3 is a cyclic bit rotation, so 
that there is actually no need to reload R3 with the value of A 0 . Since in this 
step the apparatus is utilized for the calculation of regular multiplication, the n 
LSBs of the result are fed into the serial-in of the R4 register, and the n MSBs of 
the result remain in the CSA. 

In the next step, step III, the calculation of 
(A 1 *B°+N° *K l +R4*2 n + CSA)/2'" modN 0 is carried out. For this purpose, prior 
to any operation in this step, the value stored in the B4 register is stored in the 
CSA, and the content of the CSA is stored in the R4 register. In addition, 
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registers R0, Rl, R2, and R3, are loaded with the values, B°, N° + B°,N°, and^ 1 , 
respectively, and the control inputs of MX1, MX2, MX3, and MX4, are set to 
select the inputs of K n K n R4, R3, respectively. During the operation of this 
step, the content of the R4 register is loaded with the bits, K/ , of K x . The result 
of this step remains in the CSA for the calculation of the final step. 

In the last step, IV, the regular multiplication of A 1 *B l +N 1 *K* + CSA {m) is 
performed, wherein CSA {Ja) is the result that was obtained in step HE. The 
values of registers RO, Rl, R2, and R3, are loaded with the values B\B l + N 1 ,^, 
and A* , respectively, and the control inputs of MX1, MX2, MK3, and MX4, are 
set to select the inputs of R4, CSA* , "0" , R3, respectively. During this step the n 
LSBs of the result are loaded into the R4 register, and the n MSBs (which may 
also be of ti+1 bits) of the result are obtained in the CSA. 

The final result of each of the steps in this process (steps I to VI) may be greater 
than N, and thus reduction may be required. If it is required, reduction is 
performed by software after each step. Alternatively, one may implement the 
same method of interleaved multiplication by utilizing an extended non-reduced 
approach without needing to reduce the obtained result after each step. In 
addition, the computation of greater values may be carried out utilizing software 
for storing temporary result of the interleaved multiplication. 

The above examples and description have of course been provided only for the 
purpose of illustration, and are not intended to limit the invention in any way. 
As will be appreciated by the skilled person, the invention can be carried out in 
a great variety of ways, employing different techniques from those described 
above, all without exceeding the scope of the invention. 
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CLAIMS 

1. A method for carrying out modular arithmetic computations involving 
multiplication operations by utilizing a non-reduced and extended Montgomery 
multiplication between a first A and a second B integer values, in which the 
number of iterations required is greater than the number of bits n of an odd 
modulo value N, comprising: 

a) providing an accumulating device (S) capable of storing ra+2 bit 
values, of adding ;i+2-bit values (X) to it content (S + X-+S), and of 
dividing its content by 2 (S/2-+S); 

b) whenever desired, setting the content of said device to a zero value 
(»0 M -» S ) and performing in said device at least s(>n+l) iterations, while in 
each iteration choosing one bit, in sequence, from the value of said first 
integer value A{A t ;0<I<s-l), starting from its least significant bit (A 0 ): 

b.l) adding to the content of said device S the product of the selected bit 
Aj and said second integer value B (S + A { * B S ); 

b.2) adding to the resulting content of said device the product of its 
current least significant bit S Q and N (S + S 0 *N S); ^ 

b.3) dividing the resulting content of said device by2(S72-*iS r ); and 
b.4) obtaining a non-reduced and extended Montgomery multiplication 
result by repeating steps b.l) to b.3) s-1 additional times while in each 
time using the previous result (S). 

2. a method according to claim 1, wherein the Montgomery multiplication result 
is obtained by unifying steps b.l) to b.3) into a single step, by: 

a) providing a first storing device (R2) for storing the modulo value N; 

b) providing a second storing device (R0) for storing the value of the 
second integer B; 

c) providing a third storing device (Rl) for storing the sum of the 
modulo N and said second integer value B; 
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d) providing an arbitration circuitry having a first (Inl), second (In2) 
and third (In3) 9 inputs from said first (R2), second (R0) and third (Rl), 
storage devices respectively, and having an additional zero input (InO), said 
arbitration device receives a first (Cl) and a second (CO) control inputs, and 
is capable of selecting one of its other inputs as it output, according to the 
following steps: 

d.l) whenever its first (CI) and second (CO) control inputs are zero, 
selecting said additional zero input (InO); 

d.2) whenever its first control input (Cl) is one and its second control 
input (CO) is zero, selecting its second input (In2); 

d.3) whenever its first control input (Cl) is zero and its second control 
input (CO) is one, selecting its first input (Inl); 

dA) whenever its first (Cl) and second (CO) control inputs are one, 

selecting said third input (In3); 
wherein the selected input is provided as the output of said arbitration circuitry 
which is attached to the input of the accumulating device. 

e) applying the bits of the first integer value A (A r ;0 < I < s ), one by 
one, in sequence, starting from its least significant bit (A Q ), to said first 
control input (CI); and 

f) providing circuitry for producing the state (Kj) of said second 
control input (CO) according to the state of the selected bit of said first 
integr value (A 7 ), the state of the least significant bit of said second integer 
value (5 0 ), and according to the state of the least significant bit of said 
accumulating device (S Q ). 



3. A method according to claims 2, wherein the state (Kj) of the second control 
input (CO) is produced by performing the following steps: 
a) producing a value of one (Kj ="1") whenever: 
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a.l) the state of the first control input (Cl) and the state of the least 
significant bit of the second integer value (B 0 ) are one, and the state of 

the least significant bit of the accumulating device (S 0 ) is zero; or 

a.2) the state of said first control input (Cl) and the state of the least 
significant bit (B 0 ) of said second integer value B are in different state, 

and the state of the least significant bit (S 0 ) of said accumulating device 

is one; and 

b) otherwise, producing a zero value (K 7 ="0"). 

4. A method according to claim . 3, wherein the circuitry utilized for producing 
the state of the second control input (CO) comprises a logical AND gate, and a 
logical XOR gate, where the inputs of said logical AND gate are receiving the 
states of the first control input (Cl) and the state of the least significant bit (B 0 ) 

of the second integer value B> and where the inputs of said logical XOR gate are 
receiving the output from said logical AND gate and the state of the least 
significant bit of said accumulating device (S 0 ), and where the output of said 

logical XOR gate is utilized as the state of the second control input (CO). 

5. A method according to claims 1 or 2, wherein the number of iterations s 
utilized for carrying out the Montgomery multiplication is rc+2, thereby 
obtaining an extended Montgomery multiplication result in which rc+2 iterations 
are performed. 

6. A method according to claim 2, further comprise allowing modular arithmetic 
operations to be carried out, by performing the following steps: 

a) utilizing for the first (R2) 9 second (RO), and third (Rl) storage 
devices an n+2 bits shift registers having a serial input into their most 
significant bit locations, and which may be capable of outputting their 
content in parallel; 
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b) providing said first storage device (R2) with a serial output, from 
its least significant bit location (R2 0 ), and allowing it to perform cyclic bit 
rotation; 

c) allowing said second storage device (RO) to receive on its serial 
input the least significant bit (S Q ) of the accumulating device; 

d) providing a fourth storage device (RS) capable of serially outputting 
it content, bit by bit in sequence (R3j / = 0,1,2,..., «+l), starting from its 
least significant bit (ZJ3 0 ), said fourth storage device is capable of storing 
n+2 bits, and of performing cyclic bit rotation to it content; 

e) providing a fifth storage device (R4) having a serial input and a 
serial output, and which is capable of storing values of n+2 bits; 

f) providing a sixth storage device (RS) capable of serially outputting 
it content, bit by bit in sequence (R5j 1 = 0,l,2,...,n +1), starting from its 

least significant bit, said fourth storage device is capable of storing tH-2 
bits; 

g) providing a first arbitration device (MX1) having a first input from 
said fifth storage device (i?4 y ), and a second input from the circuitry 
producing the state of the second control input (K 2 ), the output of said first 
arbitration device is attached to the second control input (CO); 

h) providing a second arbitration device (MX2) having a first input 
being equal to the least significant bit of the accumulating device (S 0 ) 9 a 
second input received from the output of said circuitry (Kj), and a third 
input connected to the serial output (R4j) of said fifth storage device (R4)> 
the output of said second arbitration device is attached to the serial input 
of said fifth storage device (R4); 

i) providing a third arbitration device (MX3) having a first input 
which is constantly fed with a zero value ("0")> and a second input received 
from the serial output of said fifth storage device (i?4 ; ), the output of said 
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third arbitration device is connected to a serial input of said accumulating 
device; 

j) providing a fourth arbitration device (MX4) having a first input 
connected to the serial output of said sixth storage device (J?5 y ), and a 
: second input connected to the serial output of said fourth storage device 
, (R3j), the output of said fourth arbitration device is connected to the first 
control input (C7); and 

k) providing an adder capable of performing serial addition of n+2 bit 
values, said adder receives a first input from the least significant bit 
location of the accumulating device (S 0 ), and a second input from the serial 

output of said first storage device (JR2)> the output of said adder., is 
connected to the serial input of said third storage device (Rl). 

7. A method according to claim 6, wherein the accumulating device consist of 
7i+2 addition and latching stages, each of which consists of a first and a second 
flip flop devices and a full adder device having three inputs, except for the first 
stage wherein said second flip flop is excluded, the method comprising: 

a) connecting the first input of said full adder to the output of a first 
flip-flop device; 

b) connecting the second input of said full adder to the output of a 
second flip flop device of the subsequent addition and latching stage; and 

c) connecting the third input of said full adder to the respective bit 
output of the arbitration device {MUX^ 0 < i < n + 1 ). 

8. A method according to claim 7, further comprising adding the output from the 
third arbitration device (MX3), via the serial input of said accumulating device, 
to the addition result of the (?i+l)-th addition and latching stage by performing 
the following steps: 

a) providing the (n+l^-th addition and latching stages with a first and 
second half adder devices, and a third flip flop device; 
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b) connecting the input of the first- flip flop device to the sum output of 
said second half adder; 

c) connecting the input of the second flip flop device to the carry 
output of said second half adder, and connecting the output of said flip flop 
device to the second input of the full adder of the (n+2)-th addition and 
latching stage; 

d) connecting the first input of said second half adder to the carry 
output of the full adder of the (;z-H)-th addition and latching stage, and it 
second input, to the carry output of said first half adder; 

e) connecting the first input of said first half adder to the sum output 
of said full adder, and connecting the second input of said second half adder 
.to the output of the third arbitration device (MX3); and 

f) connecting the input of said third flip flop device to the sum output 
of said first half adder, and connecting it output to the second input of the 
full adder of the (;i-l)-th addition and latching stage. . 

9. A method according to claim 3 and 8, wherein the state of the second control 
input (CO) is determined utilizing the least significant bit of the second storage 
device (R0) 9 the output of the fourth arbitration device (MX4), the carry output 
of the full adder of the first addition and latching stage, and the sum output of 
the full adder of the second addition and latching stage, the method comprising: 

a) connecting the least significant bit of said second storage device 
. (R0) and the output of said fourth arbitration device (MX4), to the inputs of 

an AND logical gate; 

b) providing an additional half adder and an additional flip flop, 
device; 

c) connecting the first input of said half adder to the sum output of 
the full adder of the second addition and latching stage, and its second 
input to the carry output of the full adder of the first addition and latching 
stage; 
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d) connecting the sum output of said half adder to the input of said 
additional flip flop device; and 

e) connecting the output of said AND logical gate and the output 
ofsaid flip flop device to the inputs of a XOR gate, and utilizing the output 
of said XOR gate to determine the state of said second control input {CO). 

10. A method according to claim 9, further comprising carrying out non-reduced 
Montgomery squaring of an integer value B, by performing the following steps: 

a) loading the first {R2) 7 second {R0) 9 and third {Rl), storage devices 
with the values of the modulus N, said integer B, and the sum of said 
modulus and said integer (i\T+B), respectively; ?i 

b) setting the first (MX1), second (MX2), third (MX3) and fourth 
(]\dX4), arbitration devices to select the inputs of the circuitry for. producing 
the state {Kj) of the second control input {CO), the circuitry for producing 
the state {K s ) of the second control input {CO), the zero value ("0"), and the 
output of the sixth storage device {R5)> respectively; 

c) loading the content of the sixth storage device {R5) with the content 
of the second storage device {R0) 9 and loading the content of the 
accumulating device with a zero value; 

d) performing the non-reduced and extended Montgomery 
multiplication wherein the content of said sixth storage device {RS) is 
shifted by one bit to the right in each cycle; and 

• e) obtaining the non-reduced Montgomery squaring result in the 
accumulating device. 

11. A method according to claim 9, further comprising carrying out Montgomery 
.multiplication of a first {A) and second (B) integer values, by performing the 
following steps: 

a) loading the first {R2) 7 second {R0)> third {Rl), and fourth {R3) 
storage devices with the values of the modulus N 9 said second integer (B), 
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the sum of said modulus and said second integer (iV+B), and said first 
integer (A), respectively; 

b) setting the first (MX1), second (MX2), third (MX3) and fourth 
(MX4), arbitration devices to select the inputs of the circuitry for producing 
the state (K ; ) of the second control input (C0), the circuitry for producing 
the state (Kj ) of the second control input {CO), the zero value CO"), and the 
output of the fourth storage device (JR3), respectively; 

c) loading the content of the accumulating device with a zero value; 

d) performing the non-reduced and extended Montgomery 
multiplication wherein the content of said fourth storage device (RS) is 
shifted by one bit to the right in each cycle; and 

e) obtaining the non-reduced Montgomery multiplication result in the 
accumulating device. 

12. A method according to claim 9, further comprising carrying out modular 
exponentiation A E mod//, comprising: 

a) pre-calculating the adjusted operand value A' = A * 2 s mod N ; 

b) • composing an adjusted value for the exponent E = {e^e^.^^e^ 
by reversing its bit order, and eliminating the most significant bit e m _ x , to 
obtain the adjusted value E - (e 0 ,e l9 ...,e m _ 2 ) 2 ; 

c) loading the content of the first, second, third, and fifth, storage 
. devices with the values of the modulus N, said adjusted operand (A'), the 

sum of said modulus and said adjusted operand (N + A f ), and the adjusted 
exponent value E\ respectively, obtaining the bit length m of said 
exponent value E and performing the following steps: 

c. 1) right shifting the content of said fifth storage device (R4); 
c.2) performing non-reduced Montgomery squaring to obtain the non- 
reduced Montgomery square of the content of said third storage device 
(R3) in the accumulating device; 
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c.3) loading the content of said third storage device (R3) with the 
content of said accumulating device; 

c.4) loading the content of said third storage device (Rl) with the sum 
of the content of said first storage device (R2) and the content of said 
accumulating device; 

c.5) if the least significant bit (R4 0 ) of said fifth storage device equals . 
"1" performing non-reduced and extended Montgomery multiplication to 
obtain the non-reduced Montgomery multiplication result of the contents 
of said second storage device (R0) and said fourth storage device (R3), in 
said accumulating device, loading the content of said second storage 
device (RO) with the content of said accumulating device, and loading ; the 
content of said third (Rl) storage device with the sum of the contents of 
said first storage device (R2) and said accumulating device; and 
c.6) repeating steps c.l) to c.5) additional m-2 times; and 
d) performing non-reduced and extended Montgomery multiplication 
of the content of said second storage device (RO) by 1 to obtain the final 
reduced result in said accumulating. 

A method according to claim 9, further comprising carrying out modular 
exponentiation A E mod N , by performing the following steps: 

a) pre-calculating the adjusted operand value A' ~ A * 2' mod N ; 

b) loading the content of the first (R2), second (RO), third (Rl), and 
fifth (R4), storage devices with the values of the modulus N, said adjusted 
operand (A'), the sum of the modulus and the adjusted operand (N + A'), 
and the exponent value E , obtaining the bit length m of said exponent 
value E, setting a flag to "1", and performing the following steps: 

b.l) right shifting the content of said fifth storage device (R4); 
b.2) if the least significant bit (i?4 0 ) of said fifth storage device equals 
"1" checking the state of said flag, and if it does not equal "1" performing 
non-reduced and extended Montgomery multiplication to obtain the non- 
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reduced and extended Montgomery multiplication result of the contents 

of said second storage device (R0) and said fourth storage device (R3), in 

if. 

said accumulating device, loading the content of said fourth storage 
device (R3) with the content of said accumulating device, otherwise 
loading the content of said fourth storage device (R3) with the content of 
said second storage device (R0) and resetting the state of said flag to "0"; . 
b.3) performing extended and non-reduced Montgomery squaring to 
obtain the extended and non-reduced Montgomery square of the content 
of said second storage device (R0) in the accumulating device; 
b.4) loading the content of said second storage device (R0) with the 
content of said accumulating device; 

b.5) loading the content of said third storage device {Rl) with the sum 
of the content of said first storage device and the content of said 
accumulating device; 

b.6) repeating steps b.l) to b.5) m-1 additional times; and 
c) performing extended and. non-reduced Montgomery multiplication 
to obtain the extended and non-reduced Montgomery multiplication result 
of the contents of said second storage device (R0) and said fourth storage 
device (R3), in said accumulating device, loading the content of said second 
storage device (R0) with the content of said accumulating device, loading 
the content of said third storage device (Rl) with the sum of the content of 
said first storage device (R2) and the content of said accumulating device, 
and performing extended and non-reduced Montgomery multiplication of 
the content of said second storage device (RO) by 1 to obtain the final 
reduced result in said accumulating device. 

14. A method according to claim 9, further comprising carrying out modular 
multiplication of a first (A = A ] *2"+A°) and a second (B = B 1 *2" +B°) integer 
values, where said first integer, second integer, and the modulus (IV), are of 2xn 
bits, by performing the following steps: 
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a) computing the Montgomery multiplication (MMUl(a° 9 B 0 )) of the n 
least significant bits of said first integer value (A 0 ) and of said second 
integer value (5°), by performing the following steps: 

a.l) loading the first (R2), second (R0), third (Rl), and fourth (R3) 
storage devices, with the n least significant bits (N°) of said modulus 
* value (N), the n least significant bits (B°) of said second integer value 
(B), the sum (B° + N°) of the n least significant bits of said modulus 
value (N) and of the n least significant bits (5°) of said second integer 
value (B), and the n least significant bits (A 0 ) of said first integer value 
(A) 9 respective^; 

a.2) setting the first (MX1), second (MX2), third (MK3), and fourth 
(MX4), arbitration devices for selecting the input of the circuitry for 
producing the state (K 2 ) of the second control input {CO), the circuitry 
for producing the state (Kj) of the second control input (CO), the zero 
value ("0"), and the fourth storage device (R3) input, and resetting the 
content of the accumulating device to zero, if it is required; 

a. 3) carrying out Montgomery multiplication and obtaining the result 
(S( f )) in said accumulating device, and the bits state (K J 0<I ^n-l) of 

the second control input (K°) in the fifth register (R4); 

b) computing the value of A 0 * B l + N } * K° + of the n least 

significant bits of said first integer value (^4°), the n most significant bits of 
said second integer value (i? 1 ), the n most significant bits of said modulus 
value (N } ),the n-bit value (K° ) obtained in the fifth register (R4), and the 
result obtained in step a) (S^) by performing the following steps: 

b. l) loading the first (R2) y second (R0), third (Rl), and fourth (R3) 
storage devices, with the n most significant bits (N l ) of said modulus 
value (N), the n most significant bits (B l ) of said second integer value 
OB), the sum (B l +N } ) of the n most significant bits of said modulus 
value (N) and of the n most significant bits of said second integer value 
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(B), and the n least significant bits (A 0 ) of said first integer value (A), 
respectively; 

b.2) setting the first (MX1), second (MX2), third (MX3), and fourth 
(MX4), arbitration devices for selecting the input of said fifth register 
(R4), the least significant bit of said accumulating device (S Q ), the zero 
value ("0"), and the fourth storage device (R3) input; 

b. 3) carrying out the computation and obtaining the most significant 
bits of the result in said accumulating device (S^) and the least 
significant bits of said result in said fifth storage device (fyj); 

c) computing result of addition of the Montgomery multiplication of 
the n most significant bits of said first integer value (A 1 ) and the n least 
significant bits of said second integer value (B°), with the result obtained 
in step b) (R4^, S^), by performing the following steps: 

c. l) loading the first (R2), second (R0), third (Rl), and fourth (R3) 
storage devices, with the n least significant bits (N°) of said modulus 
value (N), the n least significant bits (B°) of said second integer value 
(£), the sum (B° + N°) of the n least significant bits of said modulus 
value (N) and of the n least significant bits (B°) of said second integer 
value (B), and the n most significant bits (A 1 ) of said first integer value 
(A), respectively; 

c.2) loading the content of the accumulating device (S) with the n least 
significant bits of the result obtained in the step b) (R4 {Jl) ), and loading 

the content of said fifth storage device (R4) with n most sig nifi cant bits 
of the result obtained in the step b) (S^); 

c.3) setting the first (MX1), second (MX2), third (MX3), and fourth 
(MX4)> arbitration devices for selecting the input of the circuitry for 
producing the state (it;) of the second control input (CO), the circuitry 
for producing the state (K r ) of the second control input (CO), the input 
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from the fifth storage device (R4), and the fourth storage device (R3) 
input; 

c. 4) carrying out Montgomery multiplication and obtaining the result 
(S {w) ) in said accumulating device, and the bits state (Kj 0£l£n-l) 

of the second control input (K l ) in the fifth register (R4)\ 
d) computing A* * B x + N 1 * K 1 + S {nl) of the n most significant bits of 
said first integer value (A 1 ), the n most significant bits of said second 
integer value (£'), the n most significant bits of said modulus value (N 1 ), 
the n-bit value (K x ) obtained in the fifth register (R4), and the result 
obtained in step c) (S {m) ) by performing the following steps: 

d.l) loading the first (R2), second (RO), third (Rl), and fourth {R3) 
storage devices, with the n most significant bits (N x ) of said modulus 
value (AT), the n most significant bits (B 1 ) of said second integer value 
(B), the sum (B x +N l ) of the n most significant bits of said modulus 
value (A 7 ) and of the n most significant bits of said second integer value 
(B), and the n most significant bits (A 1 ) of said first integer value (A), 
respectively; 

d. 2) setting the first (MXl), second (MX2), third (MK3), and fourth 
(MX4), arbitration devices for selecting the input of said fifth register 
(R4), the least significant bit of said accumulating device (S 0 ), the zero 
value ("0"), and the fourth storage device (R3) input; and 

d.3) carrying out the computation and obtaining the most significant 
bits of the result in said accumulating device (S (JV) ) and the least 
significant bits of said result in said fifth storage device (i? (/f ,)). 

15. A method according to claim 14, further comprising carrying out modular 
multiplication of a first (A^A'*2') and a second (B = Y i B'*2 l ) integer 
values, where said first integer, second integer, and the modulus 



WO 03/001362 



- 65 - 



PCT/IL02/00318 



may be of more than 2*n bits,, where the computation is carried 

out by computing intermediate results of the multiplication of 2xn bits 
subsequent fractions of said first integer and second integer. 

16. Apparatus for carrying . out extended and non-reduced Montgomery 
multiplication of a first (A) and second (23) integer values, in which the number 
of iterations (s) required is greater the number of bits (n) in the modulo value 
(N), and in which the Montgomery multiplication result is smaller than twice 
the modulo value (2x]V), comprising: 

a) a first storage device (R2) for storing the modulo value (N); 

b) a second storage device (RO) for storing the value of said first 
integer values (A); 

c) a third storage device (Rl) for storing the sum of said first integer 
value and said modulo (A+N); 

d) an arbitration circuitry having a first (Inl), second (In2) and third 
(In3), inputs from said first (R2), second (RO), and third (221), storage 
devices, and having a fourth input which is zero ("0")> said arbitration 
device receives a first (CI) and a second (CO) control inputs, and thereby is 
capable of selecting one of it other inputs as it output, that is attached to 
the input- of the accumulating, device; 

e) circuitry for producing the state (K 2 ) of said second control input 
(CO) according to the state of a selected bit of said first integer value (Aj ), 
the state of the least significant bit of said second integer value (B 0 ), and 

according to the state of the least significant bit of said accumulating 
device (iS 0 ); and 

f) an accumulating device (S) capable of storing n+2 bits values, of 
adding n+2-bits values (X) to it content (S + X S), and of dividing it 
content by 2 (S 1 2 S ). 
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17. Apparatus according to claims 16, in which the circuitry utilized for 
producing the state (K 2 ) of the second control input comprises: 

Circuitry for producing a value of one whenever: 

the state of the selected bit (Aj) and the state of the least 
significant bit of the second integer value (B Q ) are one, and the state of the least 
significant bit of the accumulating device (S Q ) is zero; or 

the state of said selected bit (A } ) and the state of the least 
significant bit (5 0 ) of said second integer value are in different state, and the 
state of the least significant bit (S 0 ) of said accumulating device is one; 
said circuitry produces a zero value in all other cases. 

18. Apparatus according to claim 17, in which the first (R2), second (JRO), and 
third (Rl) storage devices are n+2 bits shift registers having a serial input into 
their most significant bit locations, and which may be capable of outputting 
their content in parallel. 

19. Apparatus according to claim 17, in which said first storage device (R2) is 
having a serial output, from its least significant bit location (i?2 0 ), allowing it to 
perform cyclic bit rotation. 

20. Apparatus according to claims 17, 18, and 19, further including means for 
allowing modular arithmetic operations to be carried out, that comprises: 

a) means for connecting the serial input of the second storage device 
(R0) to the least significant bit (S 0 ) of the accumulating device (S); 

b) a fourth storage device (R3) capable of serially outputting it 
content, bit by bit in sequence ( R3 I I = 0,1,2,..., n -f 1 ), starting from its least 
significant bit (R3 0 ) 9 said fourth storage device is capable of storing 7Z+2 
bits, and of performing cyclic bit rotation to it content; 
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c) a fifth storage device (R4) having a serial input and a serial output, 
and which is capable of storing values of n+2 bits; 

d) a sbrth storage device (R5) capable of serially outputting it content, 
bit by bit in sequence (R5 I / = 0,1,2,..., * + starting from its least 
significant bit, said fourth storage device is capable of storing n+2 bits; 

e) a first arbitration device (MKT) having a first input from said fifth 
storage device (i?4 7 ), and a second input from the circuitry producing the 

state of the second control input ( K *)> the output of said first arbitration 
device is attached to the second control input (CO); 

f) a second arbitration device (MX2) having a first input being equal 
to the least significant bit of the accumulating device (S 0 ), a second input 
received from the output of said circuitry (K f ), and a third input connected 
to the serial output (R4 ; ) of said fifth storage device (R4)> the output of 
said second arbitration device is attached to the serial input of said fifth 
storage device (R4); 

g) a third arbitration device (MK3) having a first input which is 
constantly fed with a zero value ("0"), and a second input received from the 
serial output of said fifth storage device (i?4 7 ), the output of said third 

arbitration device is connected to a serial input of said accumulating 
device; 

h) a fourth arbitration device (MX4) having a first input connected to 
the serial output of said sixth storage device (if5 7 ), and a second input 
connected to the serial output of said fourth storage device (£3,), the 

output of said fourth arbitration device is connected to the first control 
input (CI); and 

i) an adder capable of performing serial addition of n+2 bit values, said 
adder receives a first input from the least significant bit location of the 
accumulating device (iS^), and a second input from the serial output of the 
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first storage device (R2fc the output of said adder is connected to the serial 
input of the third storage device 

21. Apparatus according to claim 20, in which the accumulating device consist 
of n+2 addition and latching stages, each of which consists of a first and a 
second flip flop devices and a full adder device having three inputs, except for . 
the. first stage wherein said second flip flop is excluded, comprising: 

a) means for connecting the first input of said full adder to the 
output of a first flip-flop device; 

b) means for connecting the second input 'of said full adder to 
the output of a second flip flop device of the subsequent addition and 
latching stage; and 

c) means for connecting the third input of said full adder to the 
respective bit output of the arbitration device (MUX t 0 < i < n + 1 ). 

22. Apparatus according to claim 21, further including means for adding the 
output from the third arbitration device (MX3), via the serial input of said 
accumulating device, to the addition result of the (n,+l)-th addition and latching 
stage, that comprises: 

a) a first and second half adder devices, and a third flip flop 
device; 

b) means for connecting the input of the first flip flop device to 
the sum output of said second half adder; 

c) means for connecting the input of the second flip flop device 
to the carry output of said second half adder, and for connecting the 
output of said flip flop device to the second input of the full adder of 
the (7i+2)-th addition and latching stage; 

d) means for connecting the first input of said second half 
adder to the carry output of the full adder of the (7i+l)-th addition 
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and latching stage, and it second input, to the carry output of said 
first half adder; 

e) means for connecting the first input of said first half adder 
to the sum output of said full adder, and for connecting the second 
input of said second half adder to the output of the third arbitration 
device (MX3); and 

f) means for connecting the input of said third flip flop device 
to the sum output of said first half adder, and connecting it output to 
the second input of the full adder of the (tt-l)-th addition and 
latching stage. 

23. Apparatus according to claims 17 and 22, in which the state of the second 
control input (CO) is determined utilizing the least significant bit of the second 
storage device (RO), the output of the fourth arbitration device (MX4), the carry 
output of the full adder of the first addition and latching stage, and the sum 
output of the full adder of the second addition and latching stage, comprising: 

a) means for connecting the least significant bit of said second 
storage device (RO) and the output of said fourth arbitration device 
(MX4), to the inputs of an AND logical gate; 

b) an additional half adder and an additional flip flop device; 

c) means for connecting the first input of said half adder to the 

sum output of the full adder of the second addition and latching 

■» * • 

. stage, and its second input to the carry output of the full adder of 
the first addition and latching stage; 

d) means for connecting the sum output of said half adder to 
the input of said additional flip flop device; and 

e) means for connecting the output of said AND logical gate 
and the output of said flip flop device to the inputs of a XOR gate, 
and utilizing the output of said XOR gate to determine the state of 
said second control input (CO). 
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